MultiTalk – An Audio-Driven Framework for Generating Multi-Person Conversation Videos

AI Tools updated 21h ago dongdong
4 0

What is MultiTalk?

MultiTalk is a novel audio-driven multi-person conversational video generation framework jointly developed by Sun Yat-sen University (Shenzhen campus)Meituan, and The Hong Kong University of Science and Technology. It generates videos featuring interactive characters with lip movements synchronized to multi-channel audio inputs, guided by reference images and text prompts.

The framework introduces a new method called Label Rotary Position Embedding (L-RoPE) to effectively address the challenge of binding multi-channel audio to the correct characters. Leveraging partial parameter tuning and multi-task training, MultiTalk retains strong instruction-following abilities of the base model. It demonstrates state-of-the-art video generation performance across multiple datasets and is applicable in diverse scenarios such as animated conversations, singing avatars, and instruction-based video creation.

MultiTalk – An Audio-Driven Framework for Generating Multi-Person Conversation Videos


Key Features of MultiTalk

  • Audio-Driven Multi-Speaker Video Generation: Generates videos with multiple interacting characters using multi-channel audio, reference images, and text prompts, with lip-sync accuracy.

  • Audio-to-Character Binding Solution: Uses Label Rotary Position Embedding (L-RoPE) to correctly bind each audio channel to its corresponding character, avoiding mismatches.

  • Instruction-Following Capability: Maintains the model’s ability to follow text instructions using partial parameter tuning and a multi-task training strategy.


Technical Foundations of MultiTalk

  • Audio-Driven Video Generation Framework: Built on a Diffusion-in-Transformer (DiT)-based video diffusion model and a 3D Variational Autoencoder (VAE), the system compresses and reconstructs video across spatial and temporal dimensions efficiently.

  • Audio Feature Extraction: Leverages Wav2Vec to extract audio features and compress them temporally to match video frame rates. Audio cross-attention layers are added to each DiT block to allow audio to guide video generation.

  • Label Rotary Position Embedding (L-RoPE): Assigns label ranges to each character and background element. Embeds these labels into both audio and visual features using rotary position embeddings to ensure accurate audio-character alignment.

  • Adaptive Character Localization: Tracks character positions dynamically using attention maps from reference images and the generated video, enabling precise audio-to-character mapping.

  • Training Strategy:

    • Stage 1: Focuses on single-character animation.

    • Stage 2: Handles multi-character scenarios using partial parameter tuning (only audio cross-attention and adapter layers are updated) to preserve instruction-following abilities of the base model.

  • Multi-Task Learning: Combines Audio + Image to Video (AI2V) and Image to Video (I2V) tasks across various datasets to enhance the model’s generalization and performance.


Project Resources


Application Scenarios

  • Film and Entertainment: Used in animated films, VFX, and game cinematics to generate interactive dialogue scenes efficiently, enhancing visual quality and immersion.

  • Education and Training: Applied in online education, virtual classrooms, and language learning to create interactive instructional videos that simulate real conversations.

  • Advertising and Marketing: Generates product demos, virtual assistant videos, and promotional content that boost user engagement and enhance customer experience.

  • Social Media and Content Creation: Enables creators to produce engaging, multi-character conversational videos and virtual livestreams that increase content interactivity and virality.

  • Intelligent Services: Supports natural and fluent video-based interactions in virtual assistants and customer service bots, delivering more personalized and satisfying user experiences.

© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...