SkyReels-A2 – A controllable video generation framework launched by Kunlun Tech

What is SkyReels-A2?

SkyReels-A2 is a controllable video generation framework introduced by Kunlun Wanwei. It supports synthesizing videos by combining arbitrary visual elements (such as characters, objects, and backgrounds) according to text prompts, while strictly maintaining consistency with the reference images of each element. Based on a comprehensively designed data pipeline, it constructs prompts, references, and video triplets for model training and introduces a novel image-text joint embedding model. SkyReels-A2 optimizes the speed and output stability of the inference pipeline and incorporates the benchmark A2 Bench for system evaluation.

The main functions of SkyReels-A2

Multi-element Combination: Combine any visual elements (such as characters, objects, backgrounds, etc.) into a synthesized video while strictly maintaining consistency with the reference images of each element.
Text-driven Generation: Generate videos based on text prompts, allowing users to precisely control the content and style of the video through detailed textual descriptions.
High-quality Video Output: The generated videos feature high resolution and quality, meeting the needs of various application scenarios.
Real-time Interaction: Supports real-time interaction during the generation process, enabling users to adjust generation parameters to achieve video results that better meet their requirements.

The Technical Principles of SkyReels-A2

Diffusion Model: SkyReels-A2 leverages the characteristics of diffusion models to transform noise into high-quality video content step by step. The model is based on a denoising process, gradually converting random noise into the target video, with text and image prompts guiding the generation process.
Image-Text Joint Embedding Model: SkyReels-A2 introduces a novel image-text joint embedding model that embeds reference images and text prompts into a shared feature space. Based on a dual-branch structure, it extracts spatial features and semantic features from the reference image and injects them into the generation process of the diffusion model. Spatial features are extracted using a 3D VAE (Variational Autoencoder) to ensure the preservation of local details, while semantic features are extracted using the CLIP model to ensure global semantic consistency.
Data Pipeline: A comprehensive data pipeline is constructed to generate high-quality text, reference image, and video triplets. The data pipeline includes steps such as video preprocessing, keyframe segmentation, multi-expert video captioning, and visual element extraction, ensuring that the generated training data effectively supports model learning.
Optimized Inference Pipeline: To improve generation speed and stability, SkyReels-A2 optimizes the inference pipeline. Based on the UniPC multi-step scheduling strategy and combined with parallel processing techniques (such as Context Parallel, CFG Parallel, and VAE Parallel), the model’s inference efficiency is significantly enhanced. Model quantization and parameter-level offloading strategies are employed to reduce GPU memory consumption, enabling the model to run on consumer-grade GPUs.
Evaluation Benchmark A2 Bench: SkyReels-A2 introduces the A2 Bench benchmark to systematically evaluate the performance of Element-to-Video (E2V) tasks. A2 Bench assesses the model from multiple dimensions, such as compositional consistency, visual quality, and text alignment, ensuring that the model’s performance meets practical application requirements across various scenarios.

Project address of SkyReels-A2

Project Website: https://skyworkai.github.io/skyreels-a2.github.io/
GitHub Repository: https://github.com/SkyworkAI/SkyReels-A2
Hugging Face Model Hub: https://huggingface.co/Skywork/SkyReels-A2
arXiv Technical Paper: https://arxiv.org/pdf/2504.02436

Application Scenarios of SkyReels-A2

Drama and Film & TV Production: Quickly generate virtual scene and character videos to reduce shooting costs.
Virtual E-commerce: Generate product display and virtual try-on videos to enhance the shopping experience.
Music Video Creation: Generate creative videos based on music content without complex shooting.
Advertising and Marketing: Generate personalized advertisements and brand promotion videos to enhance attractiveness.
Education and Training: Generate virtual teaching scenarios and skill demonstration videos to improve teaching effectiveness.