AudioGen-Omni – Kuaishou’s Multimodal Audio Generation Framework

What is AudioGen-Omni？

AudioGen-Omni is a multimodal audio generation framework developed by Kuaishou. It can generate high-quality audio, speech, and music based on inputs such as video, text, or their combination. The framework features a unified lyrics-text encoder and a novel Phase-Aligned Anisotropic Positional Injection (PAAPI) technique, enabling precise audiovisual alignment and cross-modal synchronization. AudioGen-Omni supports multilingual input, delivers fast inference (generating 8 seconds of audio in just 1.91 seconds), and performs exceptionally well across various audio generation tasks. It is ideal for use cases such as video dubbing, speech synthesis, and song creation.

AudioGen-Omni – Kuaishou's Multimodal Audio Generation Framework

Key Features of AudioGen-Omni

Multimodal Audio Generation: Generates high-quality audio, speech, or songs from video, text, or their combination.
Precise Audiovisual Alignment: Achieves lip-sync and rhythm alignment using PAAPI (Phase-Aligned Anisotropic Positional Injection) for accurate cross-modal synchronization.
Multilingual Support: Supports multiple language inputs and generates corresponding speech or songs.
Fast Inference: Highly efficient—produces 8 seconds of audio in just 1.91 seconds, significantly outperforming similar models.
Flexible Input Handling: Can generate stable outputs even when one modality (text or video) is missing.
High-Fidelity Audio: Produces audio that matches the input semantics and acoustics with high precision and clarity.

Technical Highlights of AudioGen-Omni

Multimodal Diffusion Transformer (MMDiT): Integrates video, audio, and text modalities into a shared semantic space, enabling versatile audio generation. It uses a joint training paradigm with large-scale video-text-audio data to enhance cross-modal correlation.
Unified Lyrics-Text Encoder: Encodes both graphemes and phonemes into dense frame-level representations, suitable for both speech and singing tasks. Utilizes multilingual tokenization and ConvNeXt refinement to produce frame-aligned outputs.
Phase-Aligned Anisotropic Positional Injection (PAAPI): Selectively applies Rotary Position Embedding (RoPE) to temporal modalities (e.g., video, audio), improving alignment accuracy across modalities.
Dynamic Conditioning Mechanism: Avoids the limitations of fixed-text paradigms by unfreezing all modalities and masking missing inputs, allowing more flexible multimodal conditioning.
Joint Attention Mechanism: Enhances cross-modal feature fusion through Adaptive Layer Normalization (AdaLN) and promotes efficient information exchange across modalities via joint attention.

Project Links

Official Website: https://ciyou2.github.io/AudioGen-Omni/
arXiv Technical Paper: https://ciyou2.github.io/AudioGen-Omni/

Application Scenarios of AudioGen-Omni

Video Dubbing: Automatically generates voices, songs, or sound effects that match videos precisely, enhancing video creation efficiency and richness.
Speech Synthesis: Quickly converts text into natural, fluent speech—ideal for audiobooks, voice assistants, and intelligent customer service.
Song Creation: Generates matching songs from video content or lyrics to support music production and enrich video background music.
Sound Effect Generation: Creates realistic environmental or action-based sound effects based on text descriptions or video content, enhancing content immersion.