OmniAvatar – An audio-driven full-body video generation model jointly developed by Zhejiang University and Alibaba

What is OmniAvatar?

OmniAvatar is an audio-driven full-body video generation model jointly developed by Zhejiang University and Alibaba Group. Given an audio input and optional text prompts, it can generate natural and realistic full-body animated videos where the character’s expressions and motions are perfectly synchronized with the speech. Powered by pixel-level multi-stage audio embedding and LoRA fine-tuning techniques, OmniAvatar significantly improves lip-sync accuracy and the naturalness of body movements. It also supports interactions between the character and objects, background control, and emotion manipulation, making it suitable for a wide range of applications including podcasts, interactive videos, and virtual environments.

Key Features of OmniAvatar

Natural Lip Synchronization:
Generates lip movements that are precisely synchronized with the input audio, maintaining high accuracy even in complex scenarios.
Full-Body Animation Generation:
Produces smooth, realistic body motions, resulting in lifelike and engaging animated characters.
Text-Based Control:
Supports precise control over video content via text prompts, allowing customization of character movements, background settings, emotional states, and more.
Human-Object Interaction:
Capable of generating scenes where the character interacts with surrounding objects, such as picking up items or operating devices—expanding its practical use cases.
Background Control:
Allows dynamic background changes based on textual instructions to fit various scene requirements.
Emotion Control:
Characters can express different emotions like happiness, sadness, or anger, guided by text prompts, enhancing expressiveness and realism.

Technical Foundations of OmniAvatar

Pixel-Level Multi-Stage Audio Embedding:
Maps audio features into the model’s latent space at the pixel level. This approach ensures that audio directly influences body movement generation, improving both lip-sync precision and full-body motion naturalness.
LoRA (Low-Rank Adaptation) Fine-Tuning:
Applies LoRA to fine-tune pre-trained models efficiently. It introduces low-rank decompositions in the weight matrices to reduce the number of trainable parameters while preserving the model’s original capabilities—improving both training efficiency and output quality.
Long-Form Video Generation Strategy:
To support long videos, OmniAvatar uses reference image embeddings to maintain character identity and frame overlap strategies to ensure temporal continuity and avoid abrupt motion transitions.
Diffusion-Based Video Generation:
Utilizes diffusion models as the core framework, gradually denoising to generate high-quality video frames. These models excel at producing long and coherent video sequences.
Transformer Architecture Integration:
Incorporates Transformers within the diffusion framework to better capture long-range dependencies and maintain semantic coherence across frames, further enhancing the quality and consistency of the generated videos.

Project Resources

Official Website: https://omni-avatar.github.io/
GitHub Repository: https://github.com/Omni-Avatar/OmniAvatar
Hugging Face Model Page: https://huggingface.co/OmniAvatar/OmniAvatar-14B
arXiv Technical Paper: https://arxiv.org/pdf/2506.18866

Application Scenarios

Virtual Content Creation:
Used for generating realistic virtual avatars for podcasts and video bloggers, reducing production costs and enriching content formats.
Interactive Social Platforms:
Provides users with personalized avatars capable of natural expressions and motion, enabling immersive virtual interactions.
Education and Training:
Generates virtual teacher avatars that can explain educational content based on audio input, making learning more engaging and interactive.
Advertising and Marketing:
Creates customized virtual brand ambassadors that can be tailored to match brand identity and perform specific actions for targeted promotional campaigns.
Gaming and Virtual Reality:
Quickly generates expressive virtual characters with lifelike movements and emotions, enhancing immersion and realism in games and VR environments.