SoulX-Podcast – A Multi-Speaker Speech Synthesis Model Developed by Soul
What is SoulX-Podcast?
SoulX-Podcast is a multi-speaker text-to-speech (TTS) model developed by Soul AI Lab, designed specifically for generating long-form podcast-style conversations. With 1.7 billion parameters, the model supports Mandarin Chinese, English, and multiple Chinese dialects such as Sichuanese, Henanese, and Cantonese. It also features cross-dialect prompting, allowing users to generate speech in a target dialect using prompts in Mandarin.
The model supports paralinguistic control (e.g., laughter, sighs, throat clearing), enhancing the naturalness and expressiveness of synthesized speech. SoulX-Podcast can produce over 90 minutes of coherent dialogue with stable timbre and emotional continuity, making it ideal for podcasts, audiobooks, and other long-form audio applications.

Key Features of SoulX-Podcast
- 
Multi-Speaker Support: 
 Generates natural-sounding dialogues between multiple speakers, suitable for podcasts, audiobooks, and other conversational formats.
- 
Multilingual and Dialect Support: 
 Supports Mandarin, English, and various Chinese dialects (e.g., Sichuanese, Henanese, Cantonese). With Dialect-Guided Prompting, users can input prompts in Mandarin to synthesize speech in the desired dialect.
- 
Paralinguistic Control: 
 Handles non-verbal cues such as laughter, sighs, and throat clearing to make synthesized voices sound more natural and expressive.
- 
Long-Form Dialogue Generation: 
 Capable of generating continuous dialogue exceeding 90 minutes while maintaining consistent timbre and emotional flow—ideal for long podcasts or narrative audio.
- 
Zero-Shot Voice Cloning: 
 Enables high-quality personalized voice generation without requiring any voice samples from the target speaker.
Technical Foundations
- 
Base Model Architecture: 
 Built upon the Qwen3-1.7B architecture—a powerful pretrained language model—fine-tuned for multi-speaker dialogue synthesis tasks.
- 
Multi-Speaker Modeling: 
 Utilizes speaker embeddings to distinguish and switch naturally between speakers during generation.
- 
Cross-Dialect Generation: 
 Employs the Dialect-Guided Prompting (DGP) technique to generate dialectal speech from Mandarin prompts, supporting zero-shot dialect synthesis.
- 
Paralinguistic Control: 
 Incorporates special tokens such as<|laughter|>or<|sigh|>in text input, allowing the model to embed corresponding non-verbal expressions into the output speech.
- 
Long-Form Stability: 
 Optimized attention mechanisms and decoder structures ensure consistent timbre and emotional continuity across long conversations, preventing drift or inconsistency.
- 
Data Processing and Training: 
 Trained on large-scale multi-speaker dialogue datasets with preprocessing steps including speech enhancement, segmentation, speaker logging, text transcription, and quality filtering to ensure high fidelity.
Project Resources
- 
Official Website: https://soul-ailab.github.io/soulx-podcast/ 
- 
GitHub Repository: https://github.com/Soul-AILab/SoulX-Podcast 
- 
Hugging Face Collection: https://huggingface.co/collections/Soul-AILab/soulx-podcast 
- 
arXiv Paper: https://arxiv.org/pdf/2510.23541 
Application Scenarios
- 
Podcast Production: 
 Generates over 90 minutes of coherent, natural dialogue—ideal for creating podcasts in domains such as technology, culture, and entertainment.
- 
Audiobook Narration: 
 Produces expressive multi-character dialogue, enhancing storytelling for novels and long-form content.
- 
Educational Content: 
 Creates engaging multi-role dialogues for language learning, history narration, and other interactive educational materials.
- 
Entertainment and Gaming: 
 Generates lifelike character voices for games, animations, and videos, enhancing immersion.
- 
Corporate Training: 
 Simulates realistic dialogues for employee communication skills and customer service training programs.
 
                 
                 
                