MoonCast – A zero-shot AI podcast generation system that synthesizes natural podcast-style content

What is MoonCast?

MoonCast is a zero-shot podcast generation system that synthesizes natural podcast-style speech directly from plain text. Trained with long-context language models and large-scale audio datasets, it can produce podcast audio that spans several minutes, supporting both Chinese and English. MoonCast maintains high naturalness and coherence in long-form audio generation. It uses specially designed LLM prompts to generate podcast scripts, which are then converted into final podcast audio through a speech synthesis module. Users can quickly generate podcasts using simple commands and pre-trained weights.

Key Features of MoonCast

Long-Form Audio Generation
Leverages long-context language models and large-scale long-form speech data to generate multi-minute podcast-style audio with high coherence and fluidity.
Enhanced Naturalness
MoonCast’s podcast generation module adds natural conversational details to the script, which are essential for realistic podcast speech. Experiments show it significantly outperforms existing baselines in terms of naturalness and coherence.
Multilingual Support
Supports podcast generation in both Chinese and English using language-specific prompts to create the script.
Zero-Shot Voice Synthesis
Produces realistic speech based on just a few seconds of reference audio, maintaining high voice quality and speaker similarity throughout long-form content.

Technical Foundations of MoonCast

Multi-Stage Training
MoonCast’s training pipeline consists of three stages:
1. Stage One: The model learns to generate short utterances and single-speaker speech, building its zero-shot synthesis capability.
2. Stage Two: The model handles non-conversational long-form content (e.g., audiobooks) to improve long-context generation stability.
3. Stage Three: The model learns to generate long dialogues rich in conversational details, mastering complex podcast generation techniques.
Segment-Level Autoregressive Audio Reconstruction
MoonCast innovatively adopts segment-level autoregressive reconstruction, enabling the model to streamingly reconstruct current audio segments based on previously generated content—improving coherence and flow.
Spontaneity Enhancement
To boost the natural feel of podcasts, MoonCast injects spontaneous details into scripts, such as filler words, response words, and occasional stutters, making conversations sound more lifelike and unscripted.

MoonCast Project Links

Official Website: https://mooncastdemo.github.io/
GitHub Repository: https://github.com/jzq2000/MoonCast
arXiv Paper: https://arxiv.org/pdf/2503.14345
Online Demo: https://huggingface.co/spaces/jzq11111/mooncast

Application Scenarios for MoonCast

Content Creation
Transforms various forms of text—stories, technical reports, news articles—into engaging podcast audio content.
Education
Converts academic materials such as papers and ebooks into podcasts, helping students understand and absorb content more effectively.
Entertainment Industry
Generates podcast-style audio with natural conversational flow, suitable for entertainment content production.
Business Use
Creates internal training podcasts or transforms press releases, product descriptions, and marketing materials into audio formats for external communication.
Personal Use
Allows individuals to convert personal content like blogs or journals into podcasts, enabling convenient listening during activities like commuting or exercising.