StreamBridge – A framework for edge-side video large language models jointly launched by Apple and Fudan University

What is StreamBridge

StreamBridge is a framework for video large language models (Video-LLMs) jointly developed by Apple and Fudan University, designed to enable real-time understanding of streaming videos by AI. The framework utilizes a memory buffer and a round-based decay compression strategy to support long-context interactions and introduces a lightweight activation model to achieve proactive response capabilities.

The research team also released a dataset named Stream-IT, containing around 600,000 samples to enhance streaming video comprehension. Tests on mainstream offline models such as LLaVA-OV-7B, Qwen2-VL-7B, and Oryx-1.5-7B show that StreamBridge significantly improves models’ abilities in multi-turn real-time understanding and proactive response, demonstrating strong potential in the field of streaming video comprehension.

Key Features of StreamBridge

Multi-turn Real-time Understanding: Supports long-context multi-round interaction, preserving historical visual and dialogue context while processing the most recent video segments.
Proactive Response: The model can proactively monitor video streams and output timely feedback without explicit prompts, similar to human behavior.
Flexible Integration: Seamlessly integrates into existing video large language models without requiring large-scale modifications to the base models.
Data Support: Provides a large-scale streaming video understanding dataset, Stream-IT, with around 600,000 samples. The dataset supports diverse instruction formats for model training and optimization.

Technical Principles of StreamBridge

Memory Buffer: Stores and retrieves embedding information of video frames to support multi-turn interaction. Each new video frame is independently encoded and appended to the buffer. Upon receiving a user query, the buffer content is flattened into a single input embedding sequence, which is then fed into the language model for response generation.
Round-based Decay Compression Strategy: Before generating each response, if the total input embedding length exceeds a predefined maximum, the model starts from the earliest interaction rounds and merges visual tokens frame by frame until the length is below the threshold. The merging operation uses average pooling to ensure recent visual context is preserved.
Lightweight Activation Model: A standalone lightweight multimodal large language model (MLLM) runs in parallel with the main video LLM. The activation model takes the current frame (along with the user query and optionally several previous frames) as input and outputs a binary signal indicating whether the main model should generate a response. A learnable activation token <ACT> is used during training to supervise activation timing via a classification head.
Stream-IT Dataset: Constructed by selecting semantically related video segments from large-scale subtitle corpora and generating multi-turn Q&A sequences to simulate real-time user interaction. The dataset includes around 600,000 samples and supports a variety of task formats, such as dense video captioning, sequential step recognition, and video-based question answering.

Project Link

arXiv Technical Paper: https://arxiv.org/pdf/2505.05467

Application Scenarios of StreamBridge

Real-time Video Interaction: Enhances real-time interaction in scenarios like video conferencing and online education.
Autonomous Driving Assistance: Processes road condition videos in real time to support driving decisions.
Smart Surveillance: Analyzes surveillance video in real time to quickly detect abnormal behavior.
Robotic Vision: Helps robots understand their surroundings in real time and interact naturally.
Content Creation: Assists video creators with real-time content analysis and editing support.