FireRedTTS-2 – A streaming text-to-speech (TTS) system launched by Xiaohongshu

What is FireRedTTS-2？

FireRedTTS-2 is an advanced long-form streaming text-to-speech (TTS) system specialized in multi-speaker dialogue generation. It adopts a 12.5Hz streaming speech tokenizer and a dual-Transformer architecture to achieve low-latency, high-fidelity, multilingual speech synthesis. It supports languages including English, Chinese, Japanese, Korean, French, German, and Russian, with zero-shot cross-lingual and code-switching voice cloning capabilities. Currently, it can generate 3-minute dialogues with up to 4 speakers, and the length and number of speakers can be extended by adding more training data. It performs exceptionally well in podcast generation and chatbot integration, offering stable and natural voice output, and is capable of generating context-aware, emotionally rich speech.

Main Features of FireRedTTS-2

Long-form dialogue generation: Supports 3-minute dialogues with up to 4 speakers. Dialogue length and number of speakers can be expanded with additional training data.
Multilingual support: Covers English, Chinese, Japanese, Korean, French, German, and Russian, with zero-shot cross-lingual and code-switching voice cloning capabilities.
Low latency & high fidelity: On an L20 GPU, the first packet latency is as low as 140ms, making it suitable for real-time interactive scenarios while maintaining high-quality audio.
Stable voice output: In both monologue and dialogue tests, generated voices show high similarity to target speakers, with low speech recognition error rates, ensuring consistent quality and prosody.
Random timbre generation: Can generate voices with random characteristics, useful for building speech recognition training datasets or providing diverse test material for speech interaction systems.
Emotion and prosody generation: In chatbot integration, it can generate emotionally expressive speech according to context, enhancing interaction quality.
Real-time streaming generation: Uses a 12.5Hz streaming speech tokenizer to support high-fidelity streaming decoding, making it ideal for real-time applications.

Technical Principles of FireRedTTS-2

12.5Hz streaming speech tokenizer: Operates at a low frame rate to encode richer semantic information, shorten speech sequences, stabilize text-to-token modeling, and support high-fidelity streaming decoding suitable for real-time scenarios.
Dual-Transformer architecture: Uses a text-speech interleaved format, aligning speaker-labeled text with speech tokens in chronological order. A large decoder-only Transformer predicts the first-layer tokens, while a smaller Transformer completes the subsequent layers.
Multilingual modeling: Through multilingual pretraining, it supports speech generation in multiple languages with zero-shot cross-lingual and code-switching voice cloning, adapting to different dialogue scenarios.
Low-latency design: Optimized model architecture and inference process ensure first packet latency as low as 140ms on L20 GPUs, meeting real-time interaction needs.
Long dialogue support: Efficient tokenization and modeling mechanisms enable generation of 3-minute dialogues with 4 speakers, expandable with additional training data.
Context-aware prosody: Adjusts rhythm and emotion based on context, making speech output more natural and expressive.

Project Links for FireRedTTS-2

Official Website: https://fireredteam.github.io/demos/firered_tts_2/
GitHub Repository: https://github.com/FireRedTeam/FireRedTTS2
arXiv Paper: https://arxiv.org/pdf/2509.02020v1

Application Scenarios of FireRedTTS-2

Podcast generation: Can generate multi-speaker podcast content in multiple languages, providing stable and natural speech output for multilingual podcast production.
Chatbots: Can be integrated into chatbot frameworks to generate emotionally expressive speech based on context, enhancing interactive experiences.
Voice cloning: Supports zero-shot cross-lingual and code-switching voice cloning, producing speech with high similarity to target speakers.
Speech interaction systems: Useful for building speech interaction systems, offering diverse test voices through random timbre generation for different scenarios.
Speech recognition model training: Capable of generating speech with random features, reducing reliance on real recordings for training speech recognition systems.
Multilingual speech synthesis: Supports multiple languages, suitable for applications requiring multilingual speech, such as international conferences and multilingual customer service.