MOSS-TTSD – A spoken dialogue speech generation model open-sourced by Tsinghua University lab

AI Tools updated 6d ago dongdong
8 0

What is MOSS-TTSD?

MOSS-TTSD (Text to Spoken Dialogue) is an open-source spoken dialogue voice generation model developed by the Speech and Language Lab at Tsinghua University (Tencent AI Lab). It can convert text-based dialogue scripts into natural, expressive conversational speech, supporting both Chinese and English. Built on advanced semantic-acoustic neural audio codec models and large-scale pre-trained language models, MOSS-TTSD was trained on over 1 million hours of single-speaker voice data and 400,000 hours of dialogue speech. It supports zero-shot voice cloning and can generate multi-speaker conversational audio with accurate speaker switching. This makes it ideal for applications like AI podcasts, interviews, news reports, and more.

MOSS-TTSD – A spoken dialogue speech generation model open-sourced by Tsinghua University lab


Key Features of MOSS-TTSD

  • Expressive Dialogue Speech Generation: Converts dialogue scripts into natural, expressive audio, accurately capturing prosody, intonation, and emotion.

  • Zero-Shot Multi-Speaker Voice Cloning: Accurately switches speakers in dialogue without requiring additional samples—supports realistic two-speaker interactions.

  • Bilingual Support: Capable of generating high-quality spoken dialogue in both Chinese and English.

  • Long-Form Audio Generation: Uses a low-bitrate codec and optimized training framework to generate audio up to 960 seconds long in a single pass, avoiding unnatural transitions from audio stitching.

  • Fully Open-Source and Commercial-Ready: Model weights, inference code, and API interfaces are fully open source and free for commercial use.


Technical Principles of MOSS-TTSD

  • Core Architecture: MOSS-TTSD is fine-tuned from the Qwen3-1.7B-base model, using a discrete speech token modeling approach. It applies eight-layer Residual Vector Quantization (RVQ) to discretize speech into token sequences. These tokens are generated via an autoregressive model with delay patterns and decoded back into speech using a tokenizer-based decoder.

  • XY-Tokenizer for Speech Discretization: A key innovation is the XY-Tokenizer, a custom speech discretization encoder trained using two-stage multitask learning:

    • Stage 1: Trained on Automatic Speech Recognition (ASR) and reconstruction tasks to encode semantic while preserving coarse acoustic features.

    • Stage 2: Freezes the encoder and quantization layers, then trains the decoder using reconstruction loss and GAN loss to refine fine-grained acoustic information. XY-Tokenizer achieves superior performance at 1 kbps bitrate and 12.5Hz frame rate, outperforming similar codecs.

  • Data Processing & Pretraining: Trained on ~1 million hours of single-speaker speech and 400,000 hours of multi-speaker dialogue data, filtered and annotated through a high-efficiency data pipeline. Pretraining on 1.1 million hours of bilingual TTS data significantly enhanced prosody and expressiveness.

  • Long Audio Generation Capability: Thanks to ultra-low bitrate codecs, MOSS-TTSD can generate audio up to 960 seconds in length in a single pass, ensuring natural continuity without needing to stitch together audio segments.


Project Links


Application Scenarios for MOSS-TTSD

  • AI Podcast Production: Ideal for creating natural-sounding conversational podcasts by simulating realistic multi-speaker dialogues.

  • Film and TV Dubbing: Supports expressive bilingual dialogue generation and zero-shot voice cloning, making it suitable for dubbing in audiovisual productions.

  • Long-Form Interviews: With the ability to generate up to 960 seconds of continuous speech, MOSS-TTSD is perfect for producing long-form interview audio with seamless transitions.

  • News Reporting: Generates engaging, dialogue-style speech for news narration, enhancing the listening experience.

  • E-Commerce Livestreaming: Enables digital humans to conduct conversational product promotions with natural-sounding voice generation to attract and engage viewers.

© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...