SlowFast-LLaVA-1.5 – A Multimodal Long-Video Understanding Model Released by Apple

What is SlowFast-LLaVA-1.5？

SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5) is an efficient large video language model specifically designed for long-video understanding. Based on a dual-stream (SlowFast) mechanism, it balances processing more input frames with reducing the number of tokens per frame, capturing detailed spatial features while efficiently handling long temporal motion information. The model is available in sizes ranging from 1B to 7B parameters and is trained using a simplified two-stage training process on a mixture of high-quality public datasets. It performs exceptionally well on long-video understanding tasks, maintains strong performance on image understanding tasks, and shows significant advantages in smaller-scale models, making it ideal for lightweight and mobile-friendly video understanding applications.

Key Features of SlowFast-LLaVA-1.5

Efficient Long-Video Understanding: Processes complex spatiotemporal information in long videos efficiently, capturing long-term context, suitable for understanding and analyzing long-form video content.
Multimodal Fusion: Combines video and image inputs for comprehensive visual understanding, supports joint training on video and image tasks, enhancing performance across multiple visual tasks.
Lightweight & Mobile-Friendly: Designed for lightweight deployment, suitable for resource-constrained environments such as mobile devices, supporting edge computing and real-time applications.
Powerful Reasoning Capabilities: Built on a large language model (LLM) architecture, capable of generating detailed descriptions of video content and answering video-related questions.
Scalability: Offers models ranging from 1B to 7B parameters, allowing users to select the appropriate model size to balance performance and resource constraints.

Technical Principles of SlowFast-LLaVA-1.5

Dual-Stream Mechanism (SlowFast):
- Slow Stream: Processes videos at a lower frame rate to capture detailed static spatial features, focusing on keyframe information.
- Fast Stream: Processes videos at a higher frame rate but with fewer features per frame, focusing on motion information and dynamic changes.
Two-Stage Training Process:
- Stage 1 (Image Understanding): Supervised fine-tuning (SFT) on image data provides general knowledge and reasoning capabilities, ensuring solid performance on image tasks.
- Stage 2 (Joint Video & Image Training): Combines video and image data to further improve video understanding performance while retaining strong image task capabilities.
High-Quality Data Mixture:
- Image Data: Includes general, text-rich, and knowledge-based datasets such as LLaVA Complex Reasoning, ShareGPT-4v, Coco Caption, etc.
- Video Data: Covers large-scale video datasets and long-video understanding tasks such as LLaVA-Hound, ShareGPT4Video, ActivityNet-QA, ensuring strong performance across video tasks.
Model Architecture: Uses Oryx-ViT as the visual encoder and Qwen2.5 series as the language model (LLM), with separate projectors for video and image inputs to accommodate modality-specific characteristics.

Project Resources

GitHub Repository: https://github.com/apple/ml-slowfast-llava
arXiv Technical Paper: https://arxiv.org/html/2503.18943v1

Application Scenarios of SlowFast-LLaVA-1.5

Long-Video Content Understanding & Summarization: Automatically generate summaries of long videos, helping users quickly grasp key content and save time.
Video Q&A Systems: Users can ask questions in natural language, and the model generates accurate answers based on long-video content, enhancing interaction experience.
Video Editing & Creation: Automatically clips key segments from long videos to produce short videos, improving content creation efficiency.
Video Surveillance & Analysis: Real-time identification of abnormal behaviors in surveillance videos, such as crowd gatherings, enhancing intelligent monitoring.
Multimedia Content Recommendation: Recommends relevant long-video content based on users’ viewing history, increasing engagement and retention.

SlowFast-LLaVA-1.5 – A Multimodal Long-Video Understanding Model Released by Apple

What is SlowFast-LLaVA-1.5？

Key Features of SlowFast-LLaVA-1.5

Technical Principles of SlowFast-LLaVA-1.5

Project Resources

Application Scenarios of SlowFast-LLaVA-1.5

ComoRAG – A Cognitively-Inspired RAG Framework Jointly Developed by South China University of Technology and WeChat

FutureX – A Dynamic Real-Time Evaluation Benchmark Launched by ByteDance in Collaboration with Fudan University and Other Institutions

Related Posts

Lovart – The first professional AI design agent, covering the entire design process from creativity to delivery

DeepMesh – A 3D Mesh Generation Framework Developed by Tsinghua University and Nanyang Technological University

Unmute – A low-latency voice interaction system launched by Kyutai

Agent Neo – An AI Agent launched by Flowith, capable of continuously performing tasks

No comments yet...