Higgs Audio V2 – An Open-Source Speech Model Capable of Simulating Multi-Speaker Interaction Scenarios

What is Higgs Audio V2?

Higgs Audio V2 is an open-source large-scale speech model developed by Mu Li and the Boson AI team. Trained on over 10 million hours of audio data, it supports multilingual dialogue generation, automatic prosody adjustment, voice cloning, and singing synthesis. The model is capable of simulating natural, fluent multi-speaker conversations, automatically matching the speaker’s emotions and intonation, and supports low-latency real-time voice interaction.

Higgs Audio V2 enables zero-shot voice cloning—users can replicate the vocal characteristics of specific individuals using only a short voice sample. It can also synthesize singing voices and simultaneously generate speech and background music, making it a powerful tool for audio content creation.

Key Features of Higgs Audio V2

Multilingual Dialogue Generation:
Supports generating dialogues in multiple languages, simulating natural multi-speaker interaction while automatically aligning with each speaker’s emotional tone and energy level.
Automatic Prosody Adjustment:
In long text-to-speech tasks, the model automatically adjusts speech speed, pauses, and intonation based on the content, producing fluent and natural-sounding speech without manual intervention.
Voice Cloning & Singing Synthesis:
With only a brief voice sample, the model can perform zero-shot voice cloning, reproducing a specific speaker’s vocal characteristics and even enabling the cloned voice to sing melodies.
Real-Time Voice Interaction:
Offers low-latency responses, understands user emotions, and responds with emotional expression for a near-human interaction experience.
Speech & Background Music Co-Generation:
Can simultaneously generate speech and background music, enabling creative workflows like “write a song and sing it” in a seamless pipeline.

Technical Highlights of Higgs Audio V2

AudioVerse Dataset:
A custom-built dataset consisting of 10 million hours of audio, annotated via an automated pipeline that combines multiple speech recognition models, sound event classification models, and proprietary audio understanding models.
Unified Audio Tokenizer:
A tokenizer trained from scratch to capture both semantic and acoustic features, allowing for more accurate and expressive audio generation.
DualFFN Architecture:
Enhances the model’s ability to handle acoustic tokens significantly—without incurring much additional computation—improving speech synthesis quality.
Zero-Shot Voice Cloning:
Through in-context learning, the model can perform voice cloning using short reference audio prompts, matching the speaker’s vocal style with high fidelity.

Project Links for Higgs Audio V2

GitHub Repository: https://github.com/boson-ai/higgs-audio
Online Demo: https://huggingface.co/spaces/smola/higgs_audio_v2

Application Scenarios for Higgs Audio V2

Real-Time Voice Interaction:
Ideal for virtual streamers, real-time voice assistants, and other use cases requiring emotional, low-latency interactions.
Audio Content Creation:
Useful in generating natural dialogues and narrations for audiobooks, interactive training modules, and dynamic storytelling.
Entertainment & Creative Industries:
Voice cloning allows for replicating specific characters’ voices, opening new creative possibilities in media and entertainment.