VoxCPM

Next-Generation Tokenizer-Free TTS
Zero-Shot Voice Cloning Technology

VoxCPM is based on the MiniCPM-4 architecture, utilizing hierarchical language modeling technology to achieve tokenizer-free end-to-end speech synthesis. Trained on 1.8 million hours of bilingual corpus, it supports context-aware speech generation and zero-shot voice cloning. With just 3-10 seconds of reference audio, it can replicate speaker timbre, accent, and emotional tone, achieving efficient inference on consumer GPUs with speech generation 6 times faster than playback speed. VoxCPM can intelligently infer intonation styles based on text content and supports cross-language synthesis between Chinese and English.

Context-Aware

Zero-Shot Cloning

Tokenizer-Free

High Efficiency

Try Demo Quick Start GitHub

Experience VoxCPM Now

Online demonstration of tokenizer-free TTS technology and zero-shot voice cloning capabilities

VibeVoice.info: Free Online Podcast Generation & Voice Cloning

If loading continues to fail, please check your network connection

Demo Temporarily Unavailable

Direct Access

Usage Guide

Zero-Shot Voice Cloning

• Upload 3-10 seconds of reference audio
• Input target text
• Generate speech with target speaker's voice

Context-Aware Generation

• No prompt audio required
• Infers intonation from text content
• Supports emotional expression and context adaptation

VoxCPM Technical Architecture

Deep dive into VoxCPM's core technical principles and innovative architectural design

Text Input

MiniCPM-4 + Hierarchical Language Modeling

FSQ Quantization

Local Diffusion

High-Quality Speech Output

Core Technical Innovations

MiniCPM-4 Backbone Network

Built on the edge-deployment optimized MiniCPM-4 large language model as the core architecture, achieving effective integration of text semantic understanding and speech feature extraction through hierarchical language modeling technology, supporting context-aware speech generation.

Tokenizer-Free End-to-End Architecture

Abandons traditional TTS text tokenization preprocessing, directly modeling in continuous speech space. Through end-to-end diffusion autoregressive architecture, achieves lossless conversion from text to speech while maintaining natural speech fluency.

FSQ Quantization Technology

Adopts Finite Scalar Quantization technology for efficient encoding of speech features, significantly reducing computational complexity and storage requirements while maintaining audio quality.

Local Diffusion Transformer

Combines the advantages of diffusion models and Transformer architecture, achieving high-quality speech generation through local diffusion mechanisms while ensuring audio quality and achieving RTF 0.17 efficient inference performance.

Zero-Shot Voice Cloning

With just a small amount of reference audio (3-10 seconds), it can extract subtle features of the speaker's timbre, accent, and emotional tone, achieving high-fidelity voice cloning.

Technical Performance Metrics

Based on authoritative international benchmarks, VoxCPM excels in multiple key indicators

0.17

Real-Time Factor (RTF)

6x faster than playback speed

0.93%

Character Error Rate (CER)

Chinese speech recognition accuracy

77.2%

Voice Similarity

Chinese voice cloning similarity

1.85%

English Error Rate (WER)

English speech recognition accuracy

Based on Seed-TTS-eval and other authoritative benchmark evaluation results

Technical Comparison Matrix

Comprehensive performance comparison between VoxCPM and mainstream TTS models

Model	RTF	CER (%)	Similarity(%)
VoxCPM	0.17	0.93	77.2
CosyVoice	0.25	3.2	0.88
F5-TTS	0.42	4.1	0.85
SparkTTS	0.31	2.8	0.89

VoxCPM

Tokenizer-free architecture

0.17

RTF

0.93%

CER

77.2%

Similarity

Full Feature Support

Competitor Comparison

CosyVoice

RTF: 0.25 WER: 3.2%

F5-TTS

RTF: 0.42 WER: 4.1%

SparkTTS

RTF: 0.31 WER: 2.8%

VoxCPM leads in speed, accuracy, and feature completeness among similar products

Core Capability Audio Demonstrations

Experience VoxCPM's exceptional performance in cross-language cloning, emotional expression, and context-aware generation

Cross-Language - EN→CN

English speaker voice cloned to Chinese speech

Cross-Language - CN→EN

Chinese speaker voice cloned to English speech

Emotion - Happy

Emotionally rich happy tone expression

Emotion - Sad

Emotionally rich sad tone expression

Context-Aware - News

Intelligent news broadcasting style

Context-Aware - Story

Intelligent storytelling style

More audio samples available at Official Demo Page

Technical Deep Dive

Explore VoxCPM's technical details and access development resources and academic research

Academic Paper

Detailed technical principles, experimental results, and performance evaluation reports

Read Paper

Open Source Code

Complete source code, model weights, and training scripts

Visit Repository

Quick Start

Deploy and use VoxCPM model in just a few simple steps

Get Started

Model Download

Pre-trained model weights ready for direct inference

Download Model

API Documentation

Detailed API interface descriptions and usage examples

View Documentation

Community Support

Join the developer community for technical support and discussions

Join Discussion

Technical Highlights

Apache 2.0

Open Source License

500M

Model Parameters

1.8M Hours

Training Data

RTX 4090

Recommended Hardware

Application Scenarios

VoxCPM demonstrates exceptional performance across multiple domains, providing powerful support for innovative applications

Audiobook Production

Rapidly generate high-quality audiobooks with consistent voice

Language Learning

Personalized speech education with multilingual accent training

Content Creation

Professional voice solutions for video dubbing and podcast production

Accessibility Applications

Personalized reading experiences for visually impaired individuals

Quick Start

Get started with VoxCPM in just a few steps to deploy and experience tokenizer-free TTS technology

Environment Setup

First clone the repository and install dependencies:

$ git clone https://github.com/OpenBMB/VoxCPM.git
$ cd VoxCPM
$ pip install -r requirements.txt

Model Download

Download pre-trained model weights:

$ huggingface-cli download openbmb/VoxCPM --local-dir ./checkpoints/VoxCPM

Quick Usage

Use Python script for speech synthesis:

from voxcpm import VoxCPM

# Initialize model
model = VoxCPM("./checkpoints/VoxCPM")

# Speech synthesis
audio = model.synthesize(
text="Hello, this is VoxCPM speech synthesis demo",
reference_audio="path/to/reference.wav"
)

GitHub Repository Model Download Academic Paper

System Requirements

• Python 3.8+
• PyTorch 1.13.0+
• CUDA 11.6+ (Recommended RTX 4090 or higher)
• Memory: 16GB+ RAM
• VRAM: 12GB+ VRAM

Frequently Asked Questions

Common questions and answers about VoxCPM technology and usage

Tokenizer-free TTS technology abandons traditional text tokenization steps and directly models in continuous speech space. This approach is similar to composing music in the original continuous space rather than chopping it into MIDI notes and rebuilding, thus maintaining the natural fluency and expressiveness of speech.

VoxCPM can run efficiently on consumer GPUs (such as RTX 4090), achieving a real-time factor of 0.17, generating speech 6 times faster than playback speed. With 500 million parameters, it represents a lightweight design in this field, balancing performance and efficiency.

With just 3-10 seconds of reference audio, VoxCPM can extract and replicate the speaker's voice characteristics, including accent and emotional tone nuances. The system achieves implicit decoupling of semantic and acoustic features through hierarchical language modeling, ensuring high-fidelity voice cloning results.

VoxCPM primarily supports Chinese and English, trained on 1.8 million hours of bilingual corpus. It also supports cross-language voice cloning, allowing English speaker voices to be applied to Chinese speech generation and vice versa. For other languages, the results may not be as ideal as Chinese and English.

VoxCPM uses the Apache 2.0 open source license with fully open code and weights. You can get the source code from the GitHub repository, download pre-trained models from HuggingFace, and deploy according to the quick start guide. Anyone can research, build, or adapt this technology.

VoxCPM

Next-Generation Tokenizer-Free TTS Zero-Shot Voice Cloning Technology

Experience VoxCPM Now

Demo Temporarily Unavailable

Usage Guide

Zero-Shot Voice Cloning

Context-Aware Generation

VoxCPM Technical Architecture

Core Technical Innovations

MiniCPM-4 Backbone Network

Tokenizer-Free End-to-End Architecture

FSQ Quantization Technology

Local Diffusion Transformer

Zero-Shot Voice Cloning

Technical Performance Metrics

Technical Comparison Matrix

VoxCPM

Competitor Comparison

Core Capability Audio Demonstrations

Cross-Language - EN→CN

Cross-Language - CN→EN

Emotion - Happy

Emotion - Sad

Context-Aware - News

Context-Aware - Story

Technical Deep Dive

Academic Paper

Open Source Code

Quick Start

Model Download

API Documentation

Community Support

Technical Highlights

Application Scenarios

Audiobook Production

Language Learning

Content Creation

Accessibility Applications

Quick Start

Environment Setup

Model Download

Quick Usage

System Requirements

Frequently Asked Questions

What is tokenizer-free TTS technology?

What are VoxCPM's hardware requirements?

How does zero-shot voice cloning work?

Which languages does VoxCPM support?

How to access and use VoxCPM?

Next-Generation Tokenizer-Free TTS
Zero-Shot Voice Cloning Technology