MoE-TTS – Kunlun Wanwei’s Speech Synthesis Framework

What is MoE-TTS？

MoE-TTS is the first Mixture-of-Experts (MoE)-based character description speech synthesis framework launched by Kunlun Wanwei’s speech team, designed to enhance understanding of open-domain text descriptions. The model combines a pre-trained large language model (LLM) with specialized speech expert modules through a MoE architecture. During training, only the speech module parameters are updated while the text module parameters remain frozen, preserving the LLM’s strong text comprehension capabilities while improving the accuracy of speech generation. Experiments show that MoE-TTS significantly outperforms existing commercial models in generating speech that closely matches descriptions, particularly when handling complex and open-domain content.

Key Features of MoE-TTS

Enhanced Open-Domain Text Understanding: Accurately interprets and generates speech aligned with complex, open-domain text descriptions, including content not seen during training.
Natural Language Description Driven: Users can control speech style and characteristics through natural language prompts (e.g., “energetic young male voice” or “actor with a New York accent”).
High-Quality Speech Generation: Produces speech with superior naturalness, emotional expression, and style consistency, outperforming traditional TTS models.
Cross-Modal Knowledge Transfer: Transfers the LLM’s text comprehension capabilities to speech generation, enhancing understanding and expression of complex semantics.

Technical Principles of MoE-TTS

Pre-Trained LLM as Base Model: Uses a pre-trained text LLM as the foundation, with frozen parameters to retain strong text understanding.
Modal Routing Strategy: Assigns text and speech tokens to text experts and speech expert modules respectively via a modal routing mechanism, avoiding cross-modal interference.
Frozen Text Expert Module: Only speech expert module parameters are updated during training; text expert module parameters remain frozen to preserve pre-trained knowledge.
Modal-Aware Transformer Components: Converts core Transformer components (layer normalization, feed-forward networks, multi-head attention) into modal-aware MoE layers, improving processing of different modalities.
Speech Generation Module: Combines diffusion models (e.g., Elucidated Diffusion Models) and VAEGAN components to convert discrete speech tokens into high-quality continuous waveforms.

Project Link

Technical Paper: MoE-TTS Paper

Application Scenarios

Virtual Assistants & Smart Customer Service: Enables natural and human-like speech responses, greatly improving user experience.
Audio Content Creation: Generates high-quality voice for audiobooks, podcasts, and other audio media, with varied styles and rich emotions.
Digital Humans & Virtual Character Voiceover: Creates personalized voices based on character settings, bringing digital humans and virtual characters to life.
Education & Training: Supports multilingual and multi-style speech generation, making learning content more diverse, engaging, and effective.
Gaming & Interactive Entertainment: Produces real-time speech matching game scenarios, making character dialogues vivid and immersive.