Open-o3 Video – A video reasoning model open-sourced jointly by Peking University and ByteDance

What is Open-o3 Video?

Open-o3 Video is an open-source video reasoning model jointly developed by Peking University and ByteDance. It achieves precise video reasoning by integrating explicit spatiotemporal evidence—such as key timestamps and bounding boxes. Supported by the carefully curated STGR dataset and a two-phase SFT-RL training strategy, the model delivers state-of-the-art performance on the V-STAR benchmark.Its non-agent architecture is optimized for handling complex spatiotemporal relationships efficiently, resulting in strong performance on video reasoning tasks. The training pipeline includes two phases—cold-start initialization and reinforcement learning—enabling the model to adapt effectively to diverse video-reasoning scenarios.

Key Features of Open-o3 Video

1. Spatiotemporal Reasoning

Integrates explicit spatiotemporal evidence, including key timestamps and bounding boxes, enabling accurate video reasoning and robust handling of temporal and spatial relationships within videos.

2. Data Curation & Training Strategy

Uses the curated STGR dataset and a two-phase SFT-RL training strategy:

Cold-start initialization with supervised learning,
Followed by reinforcement learning to optimize model performance.
This approach helps Open-o3 Video achieve strong results on the V-STAR benchmark.

3. Non-Agent Architecture

The non-agent design efficiently processes complex spatiotemporal relationships, improving both accuracy and computational efficiency in video reasoning tasks.

4. Open Source & Extensibility

Being fully open-source, the model is easy for researchers and developers to use, modify, and extend—helping accelerate progress in video reasoning research.

Technical Principles of Open-o3 Video

1. Explicit Spatiotemporal Evidence Integration

By explicitly incorporating key timestamps and bounding boxes, the model tightly couples reasoning with actual visual observations, making its predictions more interpretable and reliable.

2. Two-Phase Training Strategy

The model combines cold-start initialization and reinforcement learning:

The cold-start stage uses supervised learning to build foundational spatiotemporal reasoning ability.
The reinforcement learning stage introduces multiple reward mechanisms to further improve accuracy, temporal alignment, and spatial precision.

3. Dataset Curation

Two high-quality datasets—STGR-CoT-30k and STGR-RL-36k—provide rich spatiotemporal annotations and reasoning traces. These datasets address the lack of unified spatiotemporal supervision in existing video-reasoning resources.

4. Non-Agent Framework Design

The non-agent architecture efficiently handles complex spatiotemporal relationships, avoiding information loss and inefficiency that may occur in agent-based models, ultimately improving overall reasoning performance.

Project Links

Project Website: https://marinero4972.github.io/projects/Open-o3-Video/
GitHub Repository: https://github.com/marinero4972/Open-o3-Video
HuggingFace Model: https://huggingface.co/marinero4972/Open-o3-Video/tree/main
arXiv Paper: https://arxiv.org/pdf/2510.20579

Application Scenarios of Open-o3 Video

1. Video Content Understanding

Accurately identifies and analyzes key events and objects in videos. By leveraging explicit spatiotemporal evidence, it provides detailed reasoning and explanations, helping users understand core video content.

2. Video Question Answering Systems

Serves as the core engine for video QA. It can quickly localize relevant spatiotemporal segments based on user queries and generate accurate, interpretable answers.

3. Video Editing & Creation

Assists creators by identifying key elements and highlight moments within videos—making tasks like editing, clipping, and applying effects more efficient.

4. Intelligent Surveillance & Analysis

Supports real-time analysis of surveillance footage, rapidly detecting abnormal events and important objects with detailed spatiotemporal evidence, helping upgrade intelligent security systems.

5. Education & Training

Can be used to analyze teaching videos, helping teachers and students better understand content and providing more targeted feedback for learning.

6. Entertainment & Interactive Media

Enhances interactivity for short-video platforms, livestreaming, and similar services—for example, generating creative reasoning-based Q&A or challenges to boost user engagement.