VoxCPM Logo VoxCPM
VoxCPM Logo

VoxCPM

Next-Generation Tokenizer-Free TTS
Zero-Shot Voice Cloning Technology

VoxCPM is based on the MiniCPM-4 architecture, utilizing hierarchical language modeling technology to achieve tokenizer-free end-to-end speech synthesis. Trained on 1.8 million hours of bilingual corpus, it supports context-aware speech generation and zero-shot voice cloning. With just 3-10 seconds of reference audio, it can replicate speaker timbre, accent, and emotional tone, achieving efficient inference on consumer GPUs with speech generation 6 times faster than playback speed. VoxCPM can intelligently infer intonation styles based on text content and supports cross-language synthesis between Chinese and English.

Context-Aware
Zero-Shot Cloning
Tokenizer-Free
High Efficiency

Experience VoxCPM Now

Online demonstration of tokenizer-free TTS technology and zero-shot voice cloning capabilities

If loading continues to fail, please check your network connection

Demo Temporarily Unavailable

Usage Guide

Zero-Shot Voice Cloning

  • • Upload 3-10 seconds of reference audio
  • • Input target text
  • • Generate speech with target speaker's voice

Context-Aware Generation

  • • No prompt audio required
  • • Infers intonation from text content
  • • Supports emotional expression and context adaptation

VoxCPM Technical Architecture

Deep dive into VoxCPM's core technical principles and innovative architectural design

Text Input
MiniCPM-4 + Hierarchical Language Modeling
FSQ Quantization
Local Diffusion
High-Quality Speech Output

Core Technical Innovations

MiniCPM-4 Backbone Network

Built on the edge-deployment optimized MiniCPM-4 large language model as the core architecture, achieving effective integration of text semantic understanding and speech feature extraction through hierarchical language modeling technology, supporting context-aware speech generation.

Tokenizer-Free End-to-End Architecture

Abandons traditional TTS text tokenization preprocessing, directly modeling in continuous speech space. Through end-to-end diffusion autoregressive architecture, achieves lossless conversion from text to speech while maintaining natural speech fluency.

FSQ Quantization Technology

Adopts Finite Scalar Quantization technology for efficient encoding of speech features, significantly reducing computational complexity and storage requirements while maintaining audio quality.

Local Diffusion Transformer

Combines the advantages of diffusion models and Transformer architecture, achieving high-quality speech generation through local diffusion mechanisms while ensuring audio quality and achieving RTF 0.17 efficient inference performance.

Zero-Shot Voice Cloning

With just a small amount of reference audio (3-10 seconds), it can extract subtle features of the speaker's timbre, accent, and emotional tone, achieving high-fidelity voice cloning.

Technical Performance Metrics

Based on authoritative international benchmarks, VoxCPM excels in multiple key indicators

0.17
Real-Time Factor (RTF)
6x faster than playback speed
0.93%
Character Error Rate (CER)
Chinese speech recognition accuracy
77.2%
Voice Similarity
Chinese voice cloning similarity
1.85%
English Error Rate (WER)
English speech recognition accuracy
Based on Seed-TTS-eval and other authoritative benchmark evaluation results

Technical Comparison Matrix

Comprehensive performance comparison between VoxCPM and mainstream TTS models

VoxCPM

Tokenizer-free architecture

0.17
RTF
0.93%
CER
77.2%
Similarity
Full Feature Support

Competitor Comparison

CosyVoice
RTF: 0.25 WER: 3.2%
F5-TTS
RTF: 0.42 WER: 4.1%
SparkTTS
RTF: 0.31 WER: 2.8%
VoxCPM leads in speed, accuracy, and feature completeness among similar products

Core Capability Audio Demonstrations

Experience VoxCPM's exceptional performance in cross-language cloning, emotional expression, and context-aware generation

Cross-Language - EN→CN

English speaker voice cloned to Chinese speech

Cross-Language - CN→EN

Chinese speaker voice cloned to English speech

Emotion - Happy

Emotionally rich happy tone expression

Emotion - Sad

Emotionally rich sad tone expression

Context-Aware - News

Intelligent news broadcasting style

Context-Aware - Story

Intelligent storytelling style

More audio samples available at Official Demo Page

Technical Deep Dive

Explore VoxCPM's technical details and access development resources and academic research

Academic Paper

Detailed technical principles, experimental results, and performance evaluation reports

Read Paper

Open Source Code

Complete source code, model weights, and training scripts

Visit Repository

Quick Start

Deploy and use VoxCPM model in just a few simple steps

Get Started

Model Download

Pre-trained model weights ready for direct inference

Download Model

API Documentation

Detailed API interface descriptions and usage examples

View Documentation

Community Support

Join the developer community for technical support and discussions

Join Discussion

Technical Highlights

Apache 2.0
Open Source License
500M
Model Parameters
1.8M Hours
Training Data
RTX 4090
Recommended Hardware

Application Scenarios

VoxCPM demonstrates exceptional performance across multiple domains, providing powerful support for innovative applications

Audiobook Production

Rapidly generate high-quality audiobooks with consistent voice

Language Learning

Personalized speech education with multilingual accent training

Content Creation

Professional voice solutions for video dubbing and podcast production

Accessibility Applications

Personalized reading experiences for visually impaired individuals

Quick Start

Get started with VoxCPM in just a few steps to deploy and experience tokenizer-free TTS technology

1

Environment Setup

First clone the repository and install dependencies:

$ git clone https://github.com/OpenBMB/VoxCPM.git
$ cd VoxCPM
$ pip install -r requirements.txt
2

Model Download

Download pre-trained model weights:

$ huggingface-cli download openbmb/VoxCPM --local-dir ./checkpoints/VoxCPM
3

Quick Usage

Use Python script for speech synthesis:

from voxcpm import VoxCPM

# Initialize model
model = VoxCPM("./checkpoints/VoxCPM")

# Speech synthesis
audio = model.synthesize(
    text="Hello, this is VoxCPM speech synthesis demo",
    reference_audio="path/to/reference.wav"
)

System Requirements

  • • Python 3.8+
  • • PyTorch 1.13.0+
  • • CUDA 11.6+ (Recommended RTX 4090 or higher)
  • • Memory: 16GB+ RAM
  • • VRAM: 12GB+ VRAM

Frequently Asked Questions

Common questions and answers about VoxCPM technology and usage

Tokenizer-free TTS technology abandons traditional text tokenization steps and directly models in continuous speech space. This approach is similar to composing music in the original continuous space rather than chopping it into MIDI notes and rebuilding, thus maintaining the natural fluency and expressiveness of speech.

VoxCPM can run efficiently on consumer GPUs (such as RTX 4090), achieving a real-time factor of 0.17, generating speech 6 times faster than playback speed. With 500 million parameters, it represents a lightweight design in this field, balancing performance and efficiency.

With just 3-10 seconds of reference audio, VoxCPM can extract and replicate the speaker's voice characteristics, including accent and emotional tone nuances. The system achieves implicit decoupling of semantic and acoustic features through hierarchical language modeling, ensuring high-fidelity voice cloning results.

VoxCPM primarily supports Chinese and English, trained on 1.8 million hours of bilingual corpus. It also supports cross-language voice cloning, allowing English speaker voices to be applied to Chinese speech generation and vice versa. For other languages, the results may not be as ideal as Chinese and English.

VoxCPM uses the Apache 2.0 open source license with fully open code and weights. You can get the source code from the GitHub repository, download pre-trained models from HuggingFace, and deploy according to the quick start guide. Anyone can research, build, or adapt this technology.