Video-T1 – Video Generation Technology Jointly Launched by Tsinghua University and Tencent

What is Video-T1?

Video-T1 is a video generation technology jointly developed by researchers from Tsinghua University and Tencent. It leverages Test-Time Scaling (TTS) to enhance the quality and consistency of video generation. Unlike traditional video generation models that directly produce videos after training, Video-T1 introduces additional computational resources during the testing phase to optimize video quality by dynamically adjusting the generation path. The research introduces the Tree-of-Frames (ToF) method, which divides video generation into multiple stages to progressively improve frame coherence and alignment with text prompts. Video-T1 provides a novel optimization approach for the field of video generation, showcasing the powerful potential of test-time scaling.

The main functions of Video-T1

Enhance video quality: Increase computing resources during the testing phase to generate higher-quality videos, reducing blurriness and noise.
Improve text consistency: Ensure that the generated videos align with the given text prompts, enhancing the match between the video and the text.
Optimize video coherence: Improve the smoothness of motion and temporal consistency between video frames, reducing flickering and jittering.
Adapt to complex scenarios: Generate more stable and realistic video content when dealing with complex scenes and dynamic objects.

The Technical Principle of Video-T1

Search Space Construction: Leverage feedback from test-time verifiers and combine heuristic algorithms to guide the search process.
Random Linear Search: Add noise to candidate samples during inference, gradually denoise to generate video clips, and select the result with the highest verifier score.
Tree-of-Frames (ToF) Method:
◦ Image-level Alignment: The generation of the initial frame affects subsequent frames.
◦ Dynamic Prompt Application: Dynamically adjust prompts in the test verifier, focusing on motion stability and physical plausibility.
◦ Overall Quality Assessment: Evaluate the overall quality of the video and select the one that best matches the text prompt.
Self-regressive Expansion and Pruning: Dynamically expand and prune the video branch in a self-regressive manner to improve generation efficiency.

Project address of Video-T1

Project official website: https://liuff19.github.io/Video-T1/
GitHub repository: https://github.com/liuff19/Video-T1
arXiv technical paper: https://arxiv.org/pdf/2503.18942

Application scenarios of Video-T1

Creative Video Production: Quickly generate high-quality video materials that meet creative requirements for content creators and the advertising industry, enhancing content appeal.
Film and Television Production: Assist in special effects and animation production, generating complex scenes and character actions to improve the efficiency of film and television production.
Education and Training: Generate teaching videos and training simulation scenarios to enhance the fun and intuitiveness of teaching and training.
Game Development: Generate in-game cutscenes and virtual character actions to enhance the immersion and interactivity of games.
VR and AR: Generate high-quality VR content and AR dynamic effects to enhance user experience and immersion.