LONGLIVE – An Interactive Long-Video Generation Framework Introduced by NVIDIA and Others

What is LONGLIVE？

LONGLIVE is a real-time interactive long-video generation framework jointly introduced by NVIDIA and other leading institutions. By combining a frame-level autoregressive (AR) model with techniques such as KV-recache, streaming long-video fine-tuning, and short-window attention + frame sink, LONGLIVE addresses the dual challenges of efficiency and quality in long-video generation.

The framework can generate up to 240-second high-quality videos at 20.7 FPS on a single NVIDIA H100 GPU, supporting real-time prompt switching and dynamic adjustments. This opens up new creative possibilities in fields such as entertainment, education, and film production, marking a key step in transforming AI video generation from a “toy” into a productivity tool.

Key Features of LONGLIVE

Real-time interaction: Supports live prompt streaming during video generation, enabling users to dynamically adjust content, guide narratives, or change styles on the fly.
Long video generation: Produces videos lasting several minutes with coherent storytelling and scene development.
Efficient inference: Achieves real-time performance of 20.7 FPS on a single NVIDIA H100 GPU, generating up to 240 seconds of video while maintaining fidelity and temporal consistency.
High-quality output: Ensures visual coherence and semantic consistency, with smooth transitions even when prompts are frequently switched.
Low deployment cost: Supports INT8 quantized inference, reducing model size and deployment cost with minimal performance loss.

Technical Principles of LONGLIVE

KV-recache mechanism: When switching prompts, it recalculates key-value (KV) caches to “refresh” model state—removing residual traces of the old prompt while retaining visual and motion cues. This ensures smooth transitions and accurate execution of new instructions. Recache operations are integrated into training, teaching the model to handle prompt switching seamlessly.
Streaming Long Tuning: Tackles AR model quality degradation in long-video generation. By simulating inference with “rolling extension,” it reduces inconsistencies between training and inference. Local supervision and gradient detachment avoid out-of-memory (OOM) issues caused by long-sequence backpropagation, ensuring reliable teacher-model guidance.
Short-window attention + frame sink: Restricts attention range to local windows, dramatically reducing computational complexity and memory usage. The frame sink mechanism preserves global anchors (e.g., the first video frame block) to maintain long-range consistency while retaining the efficiency of short-window attention.

Project Links

GitHub Repository: https://github.com/NVlabs/LongLive
HuggingFace Model Hub: https://huggingface.co/Efficient-Large-Model/LongLive-1.3B
arXiv Paper: https://arxiv.org/pdf/2509.22622

Application Scenarios of LONGLIVE

Creative video production: Creators can adjust video content and style in real time, rapidly generating long videos tailored to creative needs, improving efficiency and flexibility.
Educational content generation: Teachers can generate instructional videos dynamically, inserting knowledge points or case studies on the fly to enhance engagement and interactivity.
Film production: Directors and screenwriters can preview different scenes and narrative paths in real time before shooting, optimizing scripts and production plans while reducing costs.
Advertising: Ad teams can instantly generate campaign videos, adapting content to client needs and improving targeting and impact.
Game development: Developers can generate cutscenes or dynamic backgrounds in real time, adjusting content based on game progress to enhance player immersion.