MultiTalk – An Audio-Driven Framework for Generating Multi-Person Conversation Videos

What is MultiTalk?

MultiTalk is a novel audio-driven multi-person conversational video generation framework jointly developed by Sun Yat-sen University (Shenzhen campus), Meituan, and The Hong Kong University of Science and Technology. It generates videos featuring interactive characters with lip movements synchronized to multi-channel audio inputs, guided by reference images and text prompts.

The framework introduces a new method called Label Rotary Position Embedding (L-RoPE) to effectively address the challenge of binding multi-channel audio to the correct characters. Leveraging partial parameter tuning and multi-task training, MultiTalk retains strong instruction-following abilities of the base model. It demonstrates state-of-the-art video generation performance across multiple datasets and is applicable in diverse scenarios such as animated conversations, singing avatars, and instruction-based video creation.

Key Features of MultiTalk

Audio-Driven Multi-Speaker Video Generation: Generates videos with multiple interacting characters using multi-channel audio, reference images, and text prompts, with lip-sync accuracy.
Audio-to-Character Binding Solution: Uses Label Rotary Position Embedding (L-RoPE) to correctly bind each audio channel to its corresponding character, avoiding mismatches.
Instruction-Following Capability: Maintains the model’s ability to follow text instructions using partial parameter tuning and a multi-task training strategy.

Technical Foundations of MultiTalk

Audio-Driven Video Generation Framework: Built on a Diffusion-in-Transformer (DiT)-based video diffusion model and a 3D Variational Autoencoder (VAE), the system compresses and reconstructs video across spatial and temporal dimensions efficiently.
Audio Feature Extraction: Leverages Wav2Vec to extract audio features and compress them temporally to match video frame rates. Audio cross-attention layers are added to each DiT block to allow audio to guide video generation.
Label Rotary Position Embedding (L-RoPE): Assigns label ranges to each character and background element. Embeds these labels into both audio and visual features using rotary position embeddings to ensure accurate audio-character alignment.
Adaptive Character Localization: Tracks character positions dynamically using attention maps from reference images and the generated video, enabling precise audio-to-character mapping.
Training Strategy:
- Stage 1: Focuses on single-character animation.
- Stage 2: Handles multi-character scenarios using partial parameter tuning (only audio cross-attention and adapter layers are updated) to preserve instruction-following abilities of the base model.
Multi-Task Learning: Combines Audio + Image to Video (AI2V) and Image to Video (I2V) tasks across various datasets to enhance the model’s generalization and performance.

Project Resources

Project Website: https://meigen-ai.github.io/multi-talk/
GitHub Repository: https://github.com/MeiGen-AI/MultiTalk
HuggingFace Model Hub: https://huggingface.co/MeiGen-AI/MeiGen-MultiTalk
arXiv Paper: https://arxiv.org/pdf/2505.22647

Application Scenarios

Film and Entertainment: Used in animated films, VFX, and game cinematics to generate interactive dialogue scenes efficiently, enhancing visual quality and immersion.
Education and Training: Applied in online education, virtual classrooms, and language learning to create interactive instructional videos that simulate real conversations.
Advertising and Marketing: Generates product demos, virtual assistant videos, and promotional content that boost user engagement and enhance customer experience.
Social Media and Content Creation: Enables creators to produce engaging, multi-character conversational videos and virtual livestreams that increase content interactivity and virality.
Intelligent Services: Supports natural and fluent video-based interactions in virtual assistants and customer service bots, delivering more personalized and satisfying user experiences.