Speech-02 – MiniMax’s New Generation Text-to-Speech Model

What is Speech-02?

Speech-02 is a next-generation text-to-speech (TTS) model developed by MiniMax. Based on an autoregressive Transformer architecture, it supports zero-shot voice cloning — generating highly similar target voices from just a few seconds of reference audio. Its Flow-VAE architecture enhances the model’s ability to represent information during speech generation, improving both quality and speaker similarity.

Speech-02 is available in two versions:

Speech-02-HD: Designed for high-fidelity applications such as dubbing and audiobooks. It eliminates rhythm inconsistencies while maintaining clear and natural audio quality.
Speech-02-Turbo: Optimized for real-time performance, balancing ultra-low latency with excellent audio quality, suitable for interactive applications.

The Speech-02 model is available on the MiniMax Audio platform and the MiniMax API platform.

Speech-02 – MiniMax's New Generation Text-to-Speech Model

Key Features of Speech-02

Zero-Shot Voice Cloning: Generates highly similar target voices from just a few seconds of reference audio.
High-Quality Speech Synthesis: Produces natural and fluent speech across multiple languages and dialects.
Multilingual Support: Supports 32 languages, with strong performance in Chinese, English, Cantonese, and more. Enables seamless language switching.
Personalized Voice Generation: Learns from user-provided sample audio to produce custom, personalized voices.
Emotional Control: Supports various emotions (e.g., happiness, sadness), allowing users to guide voice generation based on textual descriptions.

Technical Principles of Speech-02

Autoregressive Transformer Architecture: The model uses an autoregressive Transformer to generate speech features step-by-step, resulting in more natural prosody, intonation, and coherence in synthesized speech.
Zero-Shot Voice Cloning: Employs a learnable speaker encoder to extract the most useful vocal features for voice synthesis, such as unique pronunciation habits. The model only needs a few seconds of reference audio to generate a highly similar target voice.
Flow-VAE Architecture: Leverages reversible mappings in the latent space to more accurately capture complex data patterns. This enhances the representation capacity during speech generation, boosting both quality and speaker similarity.
T2V (Text-to-Voice) Framework: Combines open-ended natural language descriptions with structured label information to enable highly flexible and controllable timbre and emotion generation based on user input.

Project Links

Official Website: https://www.minimax.io/news/speech-02-series
Technical Report: https://huggingface.co/spaces/MiniMaxAI/MiniMax-Speech-Tech-Report

Application Scenarios for Speech-02

Smart Voice Assistants: Enables natural and fluent interactions for smart devices, improving user satisfaction.
Audiobooks and Voiceovers: Creates high-quality audiobooks, commercial voiceovers, and more.
Social Media and Entertainment: Delivers personalized voice generation for social media, livestreams, and singing/chatting apps to enhance interactivity and engagement.
Education and Children’s Toys: Enhances learning devices and toys with vivid and engaging voice synthesis for a fun educational experience.
Smart Hardware Integration: Embeds into smart speakers, in-car voice systems, and other devices to improve their intelligence and usability.