SoulX-Podcast – A Multi-Speaker Speech Synthesis Model Developed by Soul

AI Tools updated 2d ago dongdong
23 0

What is SoulX-Podcast?

SoulX-Podcast is a multi-speaker text-to-speech (TTS) model developed by Soul AI Lab, designed specifically for generating long-form podcast-style conversations. With 1.7 billion parameters, the model supports Mandarin Chinese, English, and multiple Chinese dialects such as Sichuanese, Henanese, and Cantonese. It also features cross-dialect prompting, allowing users to generate speech in a target dialect using prompts in Mandarin.

The model supports paralinguistic control (e.g., laughter, sighs, throat clearing), enhancing the naturalness and expressiveness of synthesized speech. SoulX-Podcast can produce over 90 minutes of coherent dialogue with stable timbre and emotional continuity, making it ideal for podcasts, audiobooks, and other long-form audio applications.

SoulX-Podcast – A Multi-Speaker Speech Synthesis Model Developed by Soul


Key Features of SoulX-Podcast

  • Multi-Speaker Support:
    Generates natural-sounding dialogues between multiple speakers, suitable for podcasts, audiobooks, and other conversational formats.

  • Multilingual and Dialect Support:
    Supports Mandarin, English, and various Chinese dialects (e.g., Sichuanese, Henanese, Cantonese). With Dialect-Guided Prompting, users can input prompts in Mandarin to synthesize speech in the desired dialect.

  • Paralinguistic Control:
    Handles non-verbal cues such as laughter, sighs, and throat clearing to make synthesized voices sound more natural and expressive.

  • Long-Form Dialogue Generation:
    Capable of generating continuous dialogue exceeding 90 minutes while maintaining consistent timbre and emotional flow—ideal for long podcasts or narrative audio.

  • Zero-Shot Voice Cloning:
    Enables high-quality personalized voice generation without requiring any voice samples from the target speaker.


Technical Foundations

  • Base Model Architecture:
    Built upon the Qwen3-1.7B architecture—a powerful pretrained language model—fine-tuned for multi-speaker dialogue synthesis tasks.

  • Multi-Speaker Modeling:
    Utilizes speaker embeddings to distinguish and switch naturally between speakers during generation.

  • Cross-Dialect Generation:
    Employs the Dialect-Guided Prompting (DGP) technique to generate dialectal speech from Mandarin prompts, supporting zero-shot dialect synthesis.

  • Paralinguistic Control:
    Incorporates special tokens such as <|laughter|> or <|sigh|> in text input, allowing the model to embed corresponding non-verbal expressions into the output speech.

  • Long-Form Stability:
    Optimized attention mechanisms and decoder structures ensure consistent timbre and emotional continuity across long conversations, preventing drift or inconsistency.

  • Data Processing and Training:
    Trained on large-scale multi-speaker dialogue datasets with preprocessing steps including speech enhancement, segmentation, speaker logging, text transcription, and quality filtering to ensure high fidelity.


Project Resources


Application Scenarios

  • Podcast Production:
    Generates over 90 minutes of coherent, natural dialogue—ideal for creating podcasts in domains such as technology, culture, and entertainment.

  • Audiobook Narration:
    Produces expressive multi-character dialogue, enhancing storytelling for novels and long-form content.

  • Educational Content:
    Creates engaging multi-role dialogues for language learning, history narration, and other interactive educational materials.

  • Entertainment and Gaming:
    Generates lifelike character voices for games, animations, and videos, enhancing immersion.

  • Corporate Training:
    Simulates realistic dialogues for employee communication skills and customer service training programs.

© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...