SkyReels-V2: Kunlun Wanwei’s Open-Source Infinite-Length Film Generative Model

What is SkyReels-V2

SkyReels-V2 is an infinite-length film generative model developed by Kunlun Wanwei’s SkyReels team. Built upon the Diffusion Forcing framework, it integrates technologies such as Multi-modal Large Language Models (MLLM), multi-stage pretraining, and reinforcement learning to generate high-quality, infinitely long video content. SkyReels-V2 addresses existing challenges in prompt adherence, visual quality, motion dynamics, and video duration coordination. It supports various applications, including story generation, image-to-video synthesis, camera directing functions, and multi-subject consistent video generation. The model and related code have been open-sourced, providing a powerful tool for creative content production and virtual simulation fields.

SkyReels-V2: Kunlun Wanwei's Open-Source Infinite-Length Film Generative Model

Key Features of SkyReels-V2

Infinite-Length Video Generation: Supports the generation of theoretically infinite-length video content, breaking the duration limitations of traditional video generation models.
Story Generation: Arranges complex multi-action sequences based on narrative text prompts to achieve dynamic storytelling.
Image-to-Video Synthesis: Offers two methods: fine-tuning a full-sequence text-to-video diffusion model (SkyReels-V2-I2V) and combining the Diffusion Forcing model with frame conditions (SkyReels-V2-DF) to transform static images into coherent videos.
Camera Directing Function: Supports the generation of smooth and diverse camera motion effects, enhancing the cinematic feel of videos.
Elements-to-Video Generation: Combines arbitrary visual elements (such as characters, objects, and backgrounds) into coherent videos guided by text prompts, suitable for applications like short dramas, music videos, and virtual e-commerce content creation.

Technical Principles of SkyReels-V2

Multi-modal Large Language Model (MLLM): Utilizes MLLMs to generate initial video descriptions, supplemented by sub-expert models (e.g., shot type, angle, position, expressions, and camera motion) for detailed shot language descriptions. Human annotation and model training further enhance the understanding of cinematic grammar, significantly improving prompt adherence in generated videos.
Multi-Stage Pretraining:
- Progressive Resolution Pretraining: Gradually increases resolution from low (256p) to high (720p) to enhance the model’s generative capabilities.
- Multi-Stage Post-Training Optimization: Includes initial concept-balanced supervised fine-tuning (SFT), motion-specific reinforcement learning (RL) training, Diffusion Forcing (DF) training, and high-quality SFT to ensure optimal performance across various aspects.
Reinforcement Learning (RL): Optimizes motion quality through reinforcement learning to address shortcomings in motion dynamics, smoothness, and physical plausibility. A semi-automated data collection pipeline generates preference comparison data pairs, training a reward model and performing Direct Preference Optimization (DPO) to enhance motion quality.
Diffusion Forcing Framework: Assigns independent noise levels to each frame, enabling infinite video generation. A non-decreasing noise schedule reduces the denoising time search space for consecutive frames from O(1e48) to O(1e32), significantly improving generation efficiency.
Efficient Data Processing and Optimization: Integrates general datasets, self-collected media, and art resource libraries, employing multi-stage filtering and annotation to ensure training data quality. Techniques like FP8 quantization, multi-GPU parallelism, and model distillation significantly reduce inference time and computational costs, enhancing the model’s practicality.

Project Addresses for SkyReels-V2

GitHub Repository: https://github.com/SkyworkAI/SkyReels-V2
HuggingFace Model Hub: https://huggingface.co/collections/Skywork/skyreels-v2
arXiv Technical Paper: https://arxiv.org/abs/2504.13074

Application Scenarios for SkyReels-V2

Film Production: Generates infinite-length coherent videos for complex storytelling and long-shot creation.
Advertising Creation: Transforms static images into dynamic videos, enhancing the appeal and expressiveness of advertisements.
Video Shooting Assistance: Generates smooth camera motion effects to aid in designing and implementing complex shots.
Short Dramas and Music Videos: Quickly produces high-quality videos, reducing shooting costs and time.
Virtual Reality and Game Development: Generates realistic virtual scenes and character animations, enhancing user experience and immersion.