
VoxCPM
Next-Generation Tokenizer-Free TTS
Zero-Shot Voice Cloning Technology
VoxCPM is based on the MiniCPM-4 architecture, utilizing hierarchical language modeling technology to achieve tokenizer-free end-to-end speech synthesis. Trained on 1.8 million hours of bilingual corpus, it supports context-aware speech generation and zero-shot voice cloning. With just 3-10 seconds of reference audio, it can replicate speaker timbre, accent, and emotional tone, achieving efficient inference on consumer GPUs with speech generation 6 times faster than playback speed. VoxCPM can intelligently infer intonation styles based on text content and supports cross-language synthesis between Chinese and English.
Experience VoxCPM Now
Online demonstration of tokenizer-free TTS technology and zero-shot voice cloning capabilities
Demo Temporarily Unavailable
Usage Guide
Zero-Shot Voice Cloning
- • Upload 3-10 seconds of reference audio
- • Input target text
- • Generate speech with target speaker's voice
Context-Aware Generation
- • No prompt audio required
- • Infers intonation from text content
- • Supports emotional expression and context adaptation
VoxCPM Technical Architecture
Deep dive into VoxCPM's core technical principles and innovative architectural design
Core Technical Innovations
MiniCPM-4 Backbone Network
Built on the edge-deployment optimized MiniCPM-4 large language model as the core architecture, achieving effective integration of text semantic understanding and speech feature extraction through hierarchical language modeling technology, supporting context-aware speech generation.
Tokenizer-Free End-to-End Architecture
Abandons traditional TTS text tokenization preprocessing, directly modeling in continuous speech space. Through end-to-end diffusion autoregressive architecture, achieves lossless conversion from text to speech while maintaining natural speech fluency.
FSQ Quantization Technology
Adopts Finite Scalar Quantization technology for efficient encoding of speech features, significantly reducing computational complexity and storage requirements while maintaining audio quality.
Local Diffusion Transformer
Combines the advantages of diffusion models and Transformer architecture, achieving high-quality speech generation through local diffusion mechanisms while ensuring audio quality and achieving RTF 0.17 efficient inference performance.
Zero-Shot Voice Cloning
With just a small amount of reference audio (3-10 seconds), it can extract subtle features of the speaker's timbre, accent, and emotional tone, achieving high-fidelity voice cloning.
Technical Performance Metrics
Based on authoritative international benchmarks, VoxCPM excels in multiple key indicators
Technical Comparison Matrix
Comprehensive performance comparison between VoxCPM and mainstream TTS models
Model | RTF | CER (%) | Similarity(%) | Zero-Shot | Multilingual | Open Source |
---|---|---|---|---|---|---|
VoxCPM
|
0.17 | 0.93 | 77.2 | |||
CosyVoice
|
0.25 | 3.2 | 0.88 | |||
F5-TTS
|
0.42 | 4.1 | 0.85 | |||
SparkTTS
|
0.31 | 2.8 | 0.89 |
VoxCPM
Tokenizer-free architecture
Competitor Comparison
Core Capability Audio Demonstrations
Experience VoxCPM's exceptional performance in cross-language cloning, emotional expression, and context-aware generation
Cross-Language - EN→CN
English speaker voice cloned to Chinese speech
Cross-Language - CN→EN
Chinese speaker voice cloned to English speech
Emotion - Happy
Emotionally rich happy tone expression
Emotion - Sad
Emotionally rich sad tone expression
Context-Aware - News
Intelligent news broadcasting style
Context-Aware - Story
Intelligent storytelling style
Technical Deep Dive
Explore VoxCPM's technical details and access development resources and academic research
Academic Paper
Detailed technical principles, experimental results, and performance evaluation reports
Read PaperCommunity Support
Join the developer community for technical support and discussions
Join DiscussionTechnical Highlights
Application Scenarios
VoxCPM demonstrates exceptional performance across multiple domains, providing powerful support for innovative applications
Audiobook Production
Rapidly generate high-quality audiobooks with consistent voice
Language Learning
Personalized speech education with multilingual accent training
Content Creation
Professional voice solutions for video dubbing and podcast production
Accessibility Applications
Personalized reading experiences for visually impaired individuals
Quick Start
Get started with VoxCPM in just a few steps to deploy and experience tokenizer-free TTS technology
Environment Setup
First clone the repository and install dependencies:
$ cd VoxCPM
$ pip install -r requirements.txt
Model Download
Download pre-trained model weights:
Quick Usage
Use Python script for speech synthesis:
# Initialize model
model = VoxCPM("./checkpoints/VoxCPM")
# Speech synthesis
audio = model.synthesize(
text="Hello, this is VoxCPM speech synthesis demo",
reference_audio="path/to/reference.wav"
)
System Requirements
- • Python 3.8+
- • PyTorch 1.13.0+
- • CUDA 11.6+ (Recommended RTX 4090 or higher)
- • Memory: 16GB+ RAM
- • VRAM: 12GB+ VRAM
Frequently Asked Questions
Common questions and answers about VoxCPM technology and usage
Tokenizer-free TTS technology abandons traditional text tokenization steps and directly models in continuous speech space. This approach is similar to composing music in the original continuous space rather than chopping it into MIDI notes and rebuilding, thus maintaining the natural fluency and expressiveness of speech.
VoxCPM can run efficiently on consumer GPUs (such as RTX 4090), achieving a real-time factor of 0.17, generating speech 6 times faster than playback speed. With 500 million parameters, it represents a lightweight design in this field, balancing performance and efficiency.
With just 3-10 seconds of reference audio, VoxCPM can extract and replicate the speaker's voice characteristics, including accent and emotional tone nuances. The system achieves implicit decoupling of semantic and acoustic features through hierarchical language modeling, ensuring high-fidelity voice cloning results.
VoxCPM primarily supports Chinese and English, trained on 1.8 million hours of bilingual corpus. It also supports cross-language voice cloning, allowing English speaker voices to be applied to Chinese speech generation and vice versa. For other languages, the results may not be as ideal as Chinese and English.
VoxCPM uses the Apache 2.0 open source license with fully open code and weights. You can get the source code from the GitHub repository, download pre-trained models from HuggingFace, and deploy according to the quick start guide. Anyone can research, build, or adapt this technology.