SoulX-Podcast – A Multi-Speaker Speech Synthesis Model Developed by Soul
What is SoulX-Podcast?
SoulX-Podcast is a multi-speaker text-to-speech (TTS) model developed by Soul AI Lab, designed specifically for generating long-form podcast-style conversations. With 1.7 billion parameters, the model supports Mandarin Chinese, English, and multiple Chinese dialects such as Sichuanese, Henanese, and Cantonese. It also features cross-dialect prompting, allowing users to generate speech in a target dialect using prompts in Mandarin.
The model supports paralinguistic control (e.g., laughter, sighs, throat clearing), enhancing the naturalness and expressiveness of synthesized speech. SoulX-Podcast can produce over 90 minutes of coherent dialogue with stable timbre and emotional continuity, making it ideal for podcasts, audiobooks, and other long-form audio applications.

Key Features of SoulX-Podcast
-
Multi-Speaker Support:
Generates natural-sounding dialogues between multiple speakers, suitable for podcasts, audiobooks, and other conversational formats. -
Multilingual and Dialect Support:
Supports Mandarin, English, and various Chinese dialects (e.g., Sichuanese, Henanese, Cantonese). With Dialect-Guided Prompting, users can input prompts in Mandarin to synthesize speech in the desired dialect. -
Paralinguistic Control:
Handles non-verbal cues such as laughter, sighs, and throat clearing to make synthesized voices sound more natural and expressive. -
Long-Form Dialogue Generation:
Capable of generating continuous dialogue exceeding 90 minutes while maintaining consistent timbre and emotional flow—ideal for long podcasts or narrative audio. -
Zero-Shot Voice Cloning:
Enables high-quality personalized voice generation without requiring any voice samples from the target speaker.
Technical Foundations
-
Base Model Architecture:
Built upon the Qwen3-1.7B architecture—a powerful pretrained language model—fine-tuned for multi-speaker dialogue synthesis tasks. -
Multi-Speaker Modeling:
Utilizes speaker embeddings to distinguish and switch naturally between speakers during generation. -
Cross-Dialect Generation:
Employs the Dialect-Guided Prompting (DGP) technique to generate dialectal speech from Mandarin prompts, supporting zero-shot dialect synthesis. -
Paralinguistic Control:
Incorporates special tokens such as<|laughter|>or<|sigh|>in text input, allowing the model to embed corresponding non-verbal expressions into the output speech. -
Long-Form Stability:
Optimized attention mechanisms and decoder structures ensure consistent timbre and emotional continuity across long conversations, preventing drift or inconsistency. -
Data Processing and Training:
Trained on large-scale multi-speaker dialogue datasets with preprocessing steps including speech enhancement, segmentation, speaker logging, text transcription, and quality filtering to ensure high fidelity.
Project Resources
-
Official Website: https://soul-ailab.github.io/soulx-podcast/
-
GitHub Repository: https://github.com/Soul-AILab/SoulX-Podcast
-
Hugging Face Collection: https://huggingface.co/collections/Soul-AILab/soulx-podcast
-
arXiv Paper: https://arxiv.org/pdf/2510.23541
Application Scenarios
-
Podcast Production:
Generates over 90 minutes of coherent, natural dialogue—ideal for creating podcasts in domains such as technology, culture, and entertainment. -
Audiobook Narration:
Produces expressive multi-character dialogue, enhancing storytelling for novels and long-form content. -
Educational Content:
Creates engaging multi-role dialogues for language learning, history narration, and other interactive educational materials. -
Entertainment and Gaming:
Generates lifelike character voices for games, animations, and videos, enhancing immersion. -
Corporate Training:
Simulates realistic dialogues for employee communication skills and customer service training programs.