LinGen – A text-to-video generation framework developed jointly by Meta and Princeton University

What is LinGen？

LinGen is a novel text-to-video generation framework developed by Princeton University in collaboration with Meta. It replaces the traditional quadratic-complexity self-attention modules in Diffusion Transformers with a linear-complexity MATE module (comprising the MA-branch and TE-branch), enabling efficient generation of high-resolution, minute-long videos on a single GPU. LinGen significantly reduces computational costs while maintaining high video quality. It outperforms existing state-of-the-art models in both video quality and generation efficiency, paving the way for long-form video generation and real-time interactive video applications.

Key Features of LinGen

High-Resolution Video Generation: Supports generation of high-resolution videos (e.g., 512p, 1024p), meeting the demands of high-quality content creation.
Long-Duration Video Generation: Capable of producing videos that span several minutes, breaking the conventional limitation of short (10–20 seconds) video outputs.
Linear Computational Complexity: Utilizes the linear-complexity MATE module, greatly reducing computational costs and enabling efficient video generation on a single GPU.
High-Quality Output: Delivers video output with strong visual quality and accurate text alignment, while maintaining frame-to-frame consistency.
Real-Time Interactive Generation: Supports real-time and interactive video generation/editing, making it suitable for a wide range of dynamic content creation scenarios.

Technical Principles of LinGen

MA-branch (Multi-scale Attention Branch):
- Bidirectional Mamba2 Module: A highly efficient sequence model with linear complexity. Its bidirectional design captures dependencies across the video sequence in both directions.
- Rotary Major Scan (RMS): Rearranges 3D video token tensors using different scanning patterns (e.g., spatial row-major, spatial column-major, temporal row-major, temporal column-major) to strengthen short-range correlations and reduce computational latency.
- Review Tokens: Adds an average-pooled token sequence before sequence processing to provide a global overview and enhance long-range dependencies.
TE-branch (Temporal Attention Branch):
- Divides the 3D video token tensor into small windows for localized self-attention.
- TESA Module: Captures both spatially adjacent and temporally medium-range token relationships.
- Alternating window shifts between layers expand the receptive field and improve temporal and spatial consistency.
Linear Complexity:
- Thanks to the MATE module design, LinGen’s computational cost scales linearly with the number of pixels in the generated video, as opposed to the quadratic complexity of traditional models. This enables efficient, high-quality generation at a lower computational expense.
Training Strategy:
- LinGen uses a progressive training strategy, first pretraining on low-resolution text-to-image tasks, then gradually increasing video resolution and duration.
- During the text-to-video pretraining stage, it combines text-image pairs with video tasks for hybrid training to improve video consistency.
- Further fine-tuning on high-quality video datasets enhances the final output quality.

Project Links

Official website: https://lineargen.github.io/
GitHub repository: https://github.com/jha-lab/LinGen
arXiv paper: https://arxiv.org/pdf/2412.09856

Application Scenarios of LinGen

Content Creation: Quickly generates high-quality video content such as advertisements, films, and TV shows, significantly reducing production time and costs.
Entertainment Industry: Generates cutscenes and background videos for games, enhancing visual effects and immersion.
Education and Training: Creates educational videos like lecture explanations and experiment demonstrations, improving engagement and interactivity; also helps generate training materials for better learning outcomes.
Advertising Videos: Rapidly generates commercial videos for various scenarios, increasing advertising production efficiency and effectiveness.
Artistic Creation: Assists artists in generating artistic videos, offering new tools for creative expression and inspiration.