OmniAvatar – An audio-driven full-body video generation model jointly developed by Zhejiang University and Alibaba
What is OmniAvatar?
OmniAvatar is an audio-driven full-body video generation model jointly developed by Zhejiang University and Alibaba Group. Given an audio input and optional text prompts, it can generate natural and realistic full-body animated videos where the character’s expressions and motions are perfectly synchronized with the speech. Powered by pixel-level multi-stage audio embedding and LoRA fine-tuning techniques, OmniAvatar significantly improves lip-sync accuracy and the naturalness of body movements. It also supports interactions between the character and objects, background control, and emotion manipulation, making it suitable for a wide range of applications including podcasts, interactive videos, and virtual environments.
Key Features of OmniAvatar
-
Natural Lip Synchronization:
Generates lip movements that are precisely synchronized with the input audio, maintaining high accuracy even in complex scenarios. -
Full-Body Animation Generation:
Produces smooth, realistic body motions, resulting in lifelike and engaging animated characters. -
Text-Based Control:
Supports precise control over video content via text prompts, allowing customization of character movements, background settings, emotional states, and more. -
Human-Object Interaction:
Capable of generating scenes where the character interacts with surrounding objects, such as picking up items or operating devices—expanding its practical use cases. -
Background Control:
Allows dynamic background changes based on textual instructions to fit various scene requirements. -
Emotion Control:
Characters can express different emotions like happiness, sadness, or anger, guided by text prompts, enhancing expressiveness and realism.
Technical Foundations of OmniAvatar
-
Pixel-Level Multi-Stage Audio Embedding:
Maps audio features into the model’s latent space at the pixel level. This approach ensures that audio directly influences body movement generation, improving both lip-sync precision and full-body motion naturalness. -
LoRA (Low-Rank Adaptation) Fine-Tuning:
Applies LoRA to fine-tune pre-trained models efficiently. It introduces low-rank decompositions in the weight matrices to reduce the number of trainable parameters while preserving the model’s original capabilities—improving both training efficiency and output quality. -
Long-Form Video Generation Strategy:
To support long videos, OmniAvatar uses reference image embeddings to maintain character identity and frame overlap strategies to ensure temporal continuity and avoid abrupt motion transitions. -
Diffusion-Based Video Generation:
Utilizes diffusion models as the core framework, gradually denoising to generate high-quality video frames. These models excel at producing long and coherent video sequences. -
Transformer Architecture Integration:
Incorporates Transformers within the diffusion framework to better capture long-range dependencies and maintain semantic coherence across frames, further enhancing the quality and consistency of the generated videos.
Project Resources
-
Official Website: https://omni-avatar.github.io/
-
GitHub Repository: https://github.com/Omni-Avatar/OmniAvatar
-
Hugging Face Model Page: https://huggingface.co/OmniAvatar/OmniAvatar-14B
-
arXiv Technical Paper: https://arxiv.org/pdf/2506.18866
Application Scenarios
-
Virtual Content Creation:
Used for generating realistic virtual avatars for podcasts and video bloggers, reducing production costs and enriching content formats. -
Interactive Social Platforms:
Provides users with personalized avatars capable of natural expressions and motion, enabling immersive virtual interactions. -
Education and Training:
Generates virtual teacher avatars that can explain educational content based on audio input, making learning more engaging and interactive. -
Advertising and Marketing:
Creates customized virtual brand ambassadors that can be tailored to match brand identity and perform specific actions for targeted promotional campaigns. -
Gaming and Virtual Reality:
Quickly generates expressive virtual characters with lifelike movements and emotions, enhancing immersion and realism in games and VR environments.