DreamVVT – A video-based virtual try-on technology developed by ByteDance in collaboration with Tsinghua University

What is DreamVVT？

DreamVVT is a Video Virtual Try-On (VVT) technology jointly developed by ByteDance and Tsinghua University (Shenzhen). Built on the Diffusion Transformer (DiTs) framework, it uses a two-stage approach to achieve high-fidelity and temporally coherent virtual try-on effects.

In the first stage, key frames are sampled from the input video and combined with a Visual Language Model (VLM) to generate semantically consistent try-on images. In the second stage, skeleton maps and motion information are used in conjunction with a pre-trained video generation model to ensure temporal coherence. DreamVVT can preserve clothing details even in complex movements and scenes, supports full-outfit try-on, and can even dress cartoon characters in real clothing.

DreamVVT – A video-based virtual try-on technology developed by ByteDance in collaboration with Tsinghua University

Main Features of DreamVVT

High-Fidelity Virtual Try-On: Produces high-quality clothing try-on effects in videos, preserving details and textures even during complex movements and in challenging scenes.
Temporal Coherence: Ensures smooth and natural transitions between frames through a two-stage process, avoiding abrupt changes.
Multi-Scene Adaptability: Works across diverse scenes and actions, including complex interactions, dynamic backgrounds, and varying lighting conditions.
Unpaired Data Training: Trains on unpaired human data, reducing data preparation difficulty and cost, and improving model generalization.
Full-Outfit Try-On: Supports both single-item and full-outfit try-on for a more complete virtual dressing experience.
Cross-Domain Applications: Can dress cartoon characters in real-world clothing, extending its use beyond conventional fashion.
Dynamic Effects Support: Generates try-on videos with realistic motion effects, such as fabric fluttering and wrinkle changes.

Technical Principles of DreamVVT

Two-Stage Processing Framework:
- Stage 1: Generate high-fidelity try-on images for key frames.
- Stage 2: Use these key frames to create a coherent try-on video.
Diffusion Transformer (DiTs): Combines the DiTs architecture with a VLM to achieve high-quality image generation and semantic consistency.
Key Frame Sampling and Generation: Samples representative frames from the input video and uses a multi-frame try-on model to create semantically consistent, high-fidelity images.
Skeleton Map and Motion Extraction: Extracts skeleton maps and motion information from the input video to guide the dynamic changes during video generation.
Pre-Trained Video Model Adaptation: Uses a LoRA adapter to enhance a pre-trained video generation model, combining key-frame try-on images with motion data to produce temporally coherent try-on videos.

Project Links

Official Website: https://virtu-lab.github.io/
GitHub Repository: https://github.com/Virtu-Lab/DreamVVT
arXiv Technical Paper: https://arxiv.org/pdf/2508.02807v1

Application Scenarios for DreamVVT

Online Shopping Platforms: Enables virtual try-on features for e-commerce, allowing consumers to upload their photos or videos to see different styles and colors in real time, improving shopping experience and reducing return rates.
Virtual Fashion Shows: Helps fashion designers showcase their work virtually, breaking the limits of physical venues and schedules, and attracting more viewers.
Entertainment & Film Production: Speeds up costume changes for characters in film and TV, reducing production costs, and enabling animated characters to wear real clothing for better visuals.
Virtual Character Customization: In gaming and VR, allows personalized clothing customization for virtual characters, enhancing user engagement and identification.
Social Media & Content Creation: Lets users share fashion looks with virtual try-on on social platforms, and helps creators produce engaging content to attract more followers.