What is Waver 1.0?
Waver 1.0 is ByteDance’s next-generation video generation model, built on the Rectified Flow Transformer architecture. It supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single unified framework, without switching between models. It supports resolutions of up to 1080p and flexible video lengths ranging from 2 to 10 seconds. The model excels at capturing complex motions, producing videos with impressive motion amplitude and strong temporal consistency. On Waver-Bench 1.0 and the Hermes motion benchmark, Waver 1.0 outperforms existing open-source and closed-source models. It also supports multiple artistic styles, including photorealism, animation, clay, plush, and more.

Key Features of Waver 1.0
- 
Unified Generation: Supports T2V, I2V, and T2I generation within a single framework—no model switching required. 
- 
High Resolution & Flexible Length: Outputs up to 1080p with adjustable resolutions, aspect ratios, and durations between 2–10 seconds. 
- 
Complex Motion Modeling: Skilled at capturing intricate motion, ensuring high motion amplitude and temporal consistency. 
- 
Multi-Shot Storytelling: Capable of producing coherent multi-shot narrative videos with consistent themes, visual style, and atmosphere. 
- 
Artistic Style Support: Generates videos in diverse artistic styles such as ultra-realism, animation, clay, and plush. 
- 
Performance Advantage: Outperforms existing open- and closed-source models on Waver-Bench 1.0 and Hermes benchmarks. 
- 
Inference Optimization: Uses APG (Adaptive Parallel Guidance) to reduce artifacts and enhance realism. 
- 
Training Strategy: Begins training on low-resolution videos, then gradually increases resolution to optimize motion modeling. 
- 
Prompt Tagging: Employs labeled prompts to distinguish data types, improving generation quality and accuracy. 
Technical Principles of Waver 1.0
- 
Model Architecture: Waver 1.0 adopts a Hybrid Stream DiT (Diffusion Transformer) architecture. It uses Wan-VAE for compressed video latent variables and integrates flan-t5-xxl and Qwen2.5-32B-Instruct for text features. Video and text modalities are fused via a dual-stream + single-stream approach. 
- 
1080p Generation: The Waver-Refiner (based on DiT) employs a flow-matching training method. Low-resolution videos (480p or 720p) are upsampled to 1080p, then noise is added, enabling high-quality 1080p outputs. A windowed attention mechanism reduces inference steps, significantly improving speed. 
- 
Training Methods: Motion learning begins with 192p videos using extensive compute resources, then scales up progressively to 480p and 720p. Following SD3’s flow-matching setup, sigma shift values are gradually increased during higher-resolution training. 
- 
Prompt Tagging: Different types of training data are distinguished with style and quality labels. During inference, negative prompts (e.g., “low resolution” or “slow motion”) are added to suppress poor-quality outputs. 
- 
Inference Optimization (APG): APG decomposes CFG (Classifier-Free Guidance) updates into parallel and orthogonal components, reducing the weight of the parallel component. This prevents oversaturation and enhances realism while reducing artifacts. 
Waver 1.0 Project Links
- 
Official Website: http://www.waver.video/ 
- 
GitHub Repository: https://github.com/FoundationVision/Waver 
- 
arXiv Paper: https://arxiv.org/pdf/2508.15761 
Application Scenarios of Waver 1.0
- 
Content Creation: Transform text into vivid videos for storytelling, advertising, or short films. 
- 
Product Showcase: Convert product images into dynamic display videos for e-commerce, live streaming, or virtual try-on. 
- 
Education & Training: Turn teaching content or training documents into interactive videos, enhancing learning experiences. 
- 
Social Media: Rapidly generate engaging, shareable video content to attract user attention. 
- 
Animation Production: Convert static images into animations, suitable for character-driven stories and special effects. 
- 
Game Development: Generate dynamic scenes and character animations to enrich immersive gameplay experiences. 
 
                 
                 
                