SoulX-Podcast – A Multi-Speaker Speech Synthesis Model Developed by Soul

What is SoulX-Podcast？

SoulX-Podcast is a multi-speaker text-to-speech (TTS) model developed by Soul AI Lab, designed specifically for generating long-form podcast-style conversations. With 1.7 billion parameters, the model supports Mandarin Chinese, English, and multiple Chinese dialects such as Sichuanese, Henanese, and Cantonese. It also features cross-dialect prompting, allowing users to generate speech in a target dialect using prompts in Mandarin.

The model supports paralinguistic control (e.g., laughter, sighs, throat clearing), enhancing the naturalness and expressiveness of synthesized speech. SoulX-Podcast can produce over 90 minutes of coherent dialogue with stable timbre and emotional continuity, making it ideal for podcasts, audiobooks, and other long-form audio applications.

Key Features of SoulX-Podcast

Multi-Speaker Support:
Generates natural-sounding dialogues between multiple speakers, suitable for podcasts, audiobooks, and other conversational formats.
Multilingual and Dialect Support:
Supports Mandarin, English, and various Chinese dialects (e.g., Sichuanese, Henanese, Cantonese). With Dialect-Guided Prompting, users can input prompts in Mandarin to synthesize speech in the desired dialect.
Paralinguistic Control:
Handles non-verbal cues such as laughter, sighs, and throat clearing to make synthesized voices sound more natural and expressive.
Long-Form Dialogue Generation:
Capable of generating continuous dialogue exceeding 90 minutes while maintaining consistent timbre and emotional flow—ideal for long podcasts or narrative audio.
Zero-Shot Voice Cloning:
Enables high-quality personalized voice generation without requiring any voice samples from the target speaker.

Technical Foundations

Base Model Architecture:
Built upon the Qwen3-1.7B architecture—a powerful pretrained language model—fine-tuned for multi-speaker dialogue synthesis tasks.
Multi-Speaker Modeling:
Utilizes speaker embeddings to distinguish and switch naturally between speakers during generation.
Cross-Dialect Generation:
Employs the Dialect-Guided Prompting (DGP) technique to generate dialectal speech from Mandarin prompts, supporting zero-shot dialect synthesis.
Paralinguistic Control:
Incorporates special tokens such as <|laughter|> or <|sigh|> in text input, allowing the model to embed corresponding non-verbal expressions into the output speech.
Long-Form Stability:
Optimized attention mechanisms and decoder structures ensure consistent timbre and emotional continuity across long conversations, preventing drift or inconsistency.
Data Processing and Training:
Trained on large-scale multi-speaker dialogue datasets with preprocessing steps including speech enhancement, segmentation, speaker logging, text transcription, and quality filtering to ensure high fidelity.

Project Resources

Official Website: https://soul-ailab.github.io/soulx-podcast/
GitHub Repository: https://github.com/Soul-AILab/SoulX-Podcast
Hugging Face Collection: https://huggingface.co/collections/Soul-AILab/soulx-podcast
arXiv Paper: https://arxiv.org/pdf/2510.23541

Application Scenarios

Podcast Production:
Generates over 90 minutes of coherent, natural dialogue—ideal for creating podcasts in domains such as technology, culture, and entertainment.
Audiobook Narration:
Produces expressive multi-character dialogue, enhancing storytelling for novels and long-form content.
Educational Content:
Creates engaging multi-role dialogues for language learning, history narration, and other interactive educational materials.
Entertainment and Gaming:
Generates lifelike character voices for games, animations, and videos, enhancing immersion.
Corporate Training:
Simulates realistic dialogues for employee communication skills and customer service training programs.