What is InfinityStar?
InfinityStar is an efficient video generation model developed by ByteDance. It adopts a unified spatiotemporal autoregressive framework to enable fast synthesis of high-resolution images and dynamic videos. The model uses a spatiotemporal pyramid structure that decomposes videos into sequential segments, effectively decoupling appearance and motion information to enhance generation efficiency. Built on a pretrained Variational Autoencoder (VAE) and equipped with a knowledge inheritance strategy, InfinityStar significantly shortens training time and reduces computational cost. It supports a wide range of generation tasks, including text-to-image, text-to-video, image-to-video, and long interactive video generation.

Key Features
High-resolution video generation:
Capable of producing high-quality 720p videos and synthesizing complex dynamic scenes efficiently.
Multi-task support:
Covers multiple generation tasks such as text-to-image, text-to-video, image-to-video, and interactive video generation, meeting diverse user needs.
High generation efficiency:
Can generate a 5-second 720p video in just 58 seconds—significantly faster than traditional diffusion models.
Unified spatiotemporal modeling:
The spatiotemporal pyramid structure effectively decouples appearance and motion information, allowing the model to capture spatial and temporal dependencies efficiently.
Knowledge inheritance strategy:
Built on a pretrained VAE, which greatly reduces training time and computational requirements.
Open-source and developer-friendly:
All code and models are open-sourced, enabling researchers and developers to adopt, adapt, and extend the system easily.
Technical Principles of InfinityStar
Unified spatiotemporal modeling:
Uses a fully discrete approach that decomposes videos into sequential segments. A spatiotemporal pyramid model jointly captures spatial and temporal dependencies, effectively separating appearance features from motion dynamics.
Efficient learning strategy:
Built upon a pretrained VAE and enhanced with a knowledge inheritance strategy, the model achieves significantly reduced training time and lower computational overhead.
Multi-task architecture:
Naturally supports tasks such as text-to-image, text-to-video, and image-to-video within a unified framework, ensuring efficient task transfer and flexible usage.
Fast generation performance:
Through architectural optimizations, InfinityStar achieves rapid video generation—up to 10× faster than traditional diffusion models for producing 5-second 720p videos.
High-quality outputs:
Demonstrates outstanding performance on the VBench benchmark, producing visually rich, detailed videos and images suitable for a wide range of applications.
Project Links
-
GitHub repository: https://github.com/FoundationVision/InfinityStar
-
HuggingFace model hub: https://huggingface.co/FoundationVision/InfinityStar
-
arXiv technical paper: https://arxiv.org/pdf/2511.04675
Application Scenarios of InfinityStar
Video creation and editing:
Quickly generates high-quality video content for advertising, film special effects, short-video production, and more—significantly enhancing creative efficiency.
Interactive media:
Supports interactive video generation for applications such as interactive games, virtual reality (VR), and augmented reality (AR), improving user engagement.
Personalized content:
Generates customized videos based on user-provided text or images, enabling personalized content recommendation and tailored services.
Animation production:
Produces smooth animation videos at lower cost and shorter production cycles, suitable for animated films, commercials, and other animation-related domains.
Education and training:
Creates dynamic instructional videos, generating animations or visuals related to teaching materials to improve learning effectiveness and engagement.
Social media:
Provides rich and appealing video content for social platforms, allowing users to quickly generate engaging videos and enhance content reach and interaction.