ZipVoice – A Zero-Shot Speech Synthesis Model Released by Xiaomi

What is ZipVoice?

ZipVoice is an efficient zero-shot text-to-speech (TTS) model released by Xiaomi AI Lab. Built on the Flow Matching architecture, it comes in two versions: ZipVoice (single-speaker) and ZipVoice-Dialog (dialogue speech). Through innovations such as Zipformer-based efficient modeling, average upsampling strategy, and Flow Distillation, the model achieves lightweight design and fast inference, addressing the issues of large parameter sizes and slow speed in existing models. ZipVoice-Dialog leverages speaker-turn embeddings and curriculum learning to deliver fast, stable, and natural dialogue synthesis.

Key Features of ZipVoice

Zero-Shot Speech Synthesis: Generates speech with a specific timbre based on input text and reference audio, without requiring large amounts of target speaker training data.
Fast Inference: Innovations like Flow Distillation significantly reduce inference steps, enabling faster synthesis and efficient operation even on low-resource devices.
High-Quality Speech Generation: Produces speech with high naturalness, strong audio quality, and good speaker similarity while maintaining fast inference.
Dialogue Speech Synthesis: The ZipVoice-Dialog version generates two-speaker dialogue with natural and accurate speaker switching, ideal for applications like AI podcasts.
Open-Source & Extensible: Model files, training code, inference code, and the OpenDialog dataset have been open-sourced, enabling developers to research and extend applications.

Technical Principles of ZipVoice

Efficient Modeling with Zipformer: Introduces the Zipformer architecture to TTS tasks for the first time, using multi-scale structures, convolution-attention synergy, and attention weight reuse to achieve efficient modeling with fewer parameters.
Average Upsampling Strategy: Assumes equal duration for each text token, applies average upsampling before feeding into the prediction model, providing stable initial alignment signals and improving stability and convergence.
Flow Distillation Acceleration: Uses pre-trained models and classifier-free guidance to allow the student model to approximate teacher predictions in a single step, reducing inference steps and avoiding extra CFG overhead.
Speaker-Turn Embeddings: Provides fine-grained speaker identity cues in dialogue synthesis, making speaker switching easier to model and more accurate.
Curriculum Learning: Pre-trains on single-speaker data to strengthen alignment, then fine-tunes on dialogue data to learn role-switching and conversational style, addressing complex alignment in dialogues.
Stereo Expansion: With weight initialization, mono-speech regularization, and speaker mutual-exclusion loss, ZipVoice-Dialog extends to stereo output, enhancing immersion in dialogue audio.

Project Links

GitHub Repository: https://github.com/k2-fsa/ZipVoice
HuggingFace Model Hub: https://huggingface.co/k2-fsa/ZipVoice
arXiv Paper: https://arxiv.org/pdf/2506.13053

Application Scenarios of ZipVoice

Personal Assistants: For voice assistants on smartphones, smart speakers, and other devices, providing more natural and personalized speech interaction.
In-Car Voice Systems: Enhancing navigation and voice control with smoother speech interaction in vehicles.
Audiobooks: Converting written content into high-quality spoken audio for novels, news, and articles.
Video Dubbing: Automatically generating voiceovers for video content, reducing manual dubbing costs and improving content creation efficiency.
Language Learning: Assisting learners with pronunciation practice by offering standard speech demonstrations through TTS.