MOSS-TTSD – A spoken dialogue speech generation model open-sourced by Tsinghua University lab

What is MOSS-TTSD？

MOSS-TTSD (Text to Spoken Dialogue) is an open-source spoken dialogue voice generation model developed by the Speech and Language Lab at Tsinghua University (Tencent AI Lab). It can convert text-based dialogue scripts into natural, expressive conversational speech, supporting both Chinese and English. Built on advanced semantic-acoustic neural audio codec models and large-scale pre-trained language models, MOSS-TTSD was trained on over 1 million hours of single-speaker voice data and 400,000 hours of dialogue speech. It supports zero-shot voice cloning and can generate multi-speaker conversational audio with accurate speaker switching. This makes it ideal for applications like AI podcasts, interviews, news reports, and more.

Key Features of MOSS-TTSD

Expressive Dialogue Speech Generation: Converts dialogue scripts into natural, expressive audio, accurately capturing prosody, intonation, and emotion.
Zero-Shot Multi-Speaker Voice Cloning: Accurately switches speakers in dialogue without requiring additional samples—supports realistic two-speaker interactions.
Bilingual Support: Capable of generating high-quality spoken dialogue in both Chinese and English.
Long-Form Audio Generation: Uses a low-bitrate codec and optimized training framework to generate audio up to 960 seconds long in a single pass, avoiding unnatural transitions from audio stitching.
Fully Open-Source and Commercial-Ready: Model weights, inference code, and API interfaces are fully open source and free for commercial use.

Technical Principles of MOSS-TTSD

Core Architecture: MOSS-TTSD is fine-tuned from the Qwen3-1.7B-base model, using a discrete speech token modeling approach. It applies eight-layer Residual Vector Quantization (RVQ) to discretize speech into token sequences. These tokens are generated via an autoregressive model with delay patterns and decoded back into speech using a tokenizer-based decoder.
XY-Tokenizer for Speech Discretization: A key innovation is the XY-Tokenizer, a custom speech discretization encoder trained using two-stage multitask learning:
- Stage 1: Trained on Automatic Speech Recognition (ASR) and reconstruction tasks to encode semantic while preserving coarse acoustic features.
- Stage 2: Freezes the encoder and quantization layers, then trains the decoder using reconstruction loss and GAN loss to refine fine-grained acoustic information. XY-Tokenizer achieves superior performance at 1 kbps bitrate and 12.5Hz frame rate, outperforming similar codecs.
Data Processing & Pretraining: Trained on ~1 million hours of single-speaker speech and 400,000 hours of multi-speaker dialogue data, filtered and annotated through a high-efficiency data pipeline. Pretraining on 1.1 million hours of bilingual TTS data significantly enhanced prosody and expressiveness.
Long Audio Generation Capability: Thanks to ultra-low bitrate codecs, MOSS-TTSD can generate audio up to 960 seconds in length in a single pass, ensuring natural continuity without needing to stitch together audio segments.

Project Links

Official Website: https://www.open-moss.com/en/moss-ttsd/
GitHub Repository: https://github.com/OpenMOSS/MOSS-TTSD
Hugging Face Model: https://huggingface.co/fnlp/MOSS-TTSD-v0.5
Live Demo: https://huggingface.co/spaces/fnlp/MOSS-TTSD

Application Scenarios for MOSS-TTSD

AI Podcast Production: Ideal for creating natural-sounding conversational podcasts by simulating realistic multi-speaker dialogues.
Film and TV Dubbing: Supports expressive bilingual dialogue generation and zero-shot voice cloning, making it suitable for dubbing in audiovisual productions.
Long-Form Interviews: With the ability to generate up to 960 seconds of continuous speech, MOSS-TTSD is perfect for producing long-form interview audio with seamless transitions.
News Reporting: Generates engaging, dialogue-style speech for news narration, enhancing the listening experience.
E-Commerce Livestreaming: Enables digital humans to conduct conversational product promotions with natural-sounding voice generation to attract and engage viewers.