SlowFast-LLaVA-1.5 – A Multimodal Long-Video Understanding Model Released by Apple

AI Tools updated 11h ago dongdong
8 0

What is SlowFast-LLaVA-1.5?

SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5) is an efficient large video language model specifically designed for long-video understanding. Based on a dual-stream (SlowFast) mechanism, it balances processing more input frames with reducing the number of tokens per frame, capturing detailed spatial features while efficiently handling long temporal motion information. The model is available in sizes ranging from 1B to 7B parameters and is trained using a simplified two-stage training process on a mixture of high-quality public datasets. It performs exceptionally well on long-video understanding tasks, maintains strong performance on image understanding tasks, and shows significant advantages in smaller-scale models, making it ideal for lightweight and mobile-friendly video understanding applications.

SlowFast-LLaVA-1.5 – A Multimodal Long-Video Understanding Model Released by Apple


Key Features of SlowFast-LLaVA-1.5

  • Efficient Long-Video Understanding: Processes complex spatiotemporal information in long videos efficiently, capturing long-term context, suitable for understanding and analyzing long-form video content.

  • Multimodal Fusion: Combines video and image inputs for comprehensive visual understanding, supports joint training on video and image tasks, enhancing performance across multiple visual tasks.

  • Lightweight & Mobile-Friendly: Designed for lightweight deployment, suitable for resource-constrained environments such as mobile devices, supporting edge computing and real-time applications.

  • Powerful Reasoning Capabilities: Built on a large language model (LLM) architecture, capable of generating detailed descriptions of video content and answering video-related questions.

  • Scalability: Offers models ranging from 1B to 7B parameters, allowing users to select the appropriate model size to balance performance and resource constraints.


Technical Principles of SlowFast-LLaVA-1.5

  • Dual-Stream Mechanism (SlowFast):

    • Slow Stream: Processes videos at a lower frame rate to capture detailed static spatial features, focusing on keyframe information.

    • Fast Stream: Processes videos at a higher frame rate but with fewer features per frame, focusing on motion information and dynamic changes.

  • Two-Stage Training Process:

    • Stage 1 (Image Understanding): Supervised fine-tuning (SFT) on image data provides general knowledge and reasoning capabilities, ensuring solid performance on image tasks.

    • Stage 2 (Joint Video & Image Training): Combines video and image data to further improve video understanding performance while retaining strong image task capabilities.

  • High-Quality Data Mixture:

    • Image Data: Includes general, text-rich, and knowledge-based datasets such as LLaVA Complex Reasoning, ShareGPT-4v, Coco Caption, etc.

    • Video Data: Covers large-scale video datasets and long-video understanding tasks such as LLaVA-Hound, ShareGPT4Video, ActivityNet-QA, ensuring strong performance across video tasks.

  • Model Architecture: Uses Oryx-ViT as the visual encoder and Qwen2.5 series as the language model (LLM), with separate projectors for video and image inputs to accommodate modality-specific characteristics.


Project Resources


Application Scenarios of SlowFast-LLaVA-1.5

  • Long-Video Content Understanding & Summarization: Automatically generate summaries of long videos, helping users quickly grasp key content and save time.

  • Video Q&A Systems: Users can ask questions in natural language, and the model generates accurate answers based on long-video content, enhancing interaction experience.

  • Video Editing & Creation: Automatically clips key segments from long videos to produce short videos, improving content creation efficiency.

  • Video Surveillance & Analysis: Real-time identification of abnormal behaviors in surveillance videos, such as crowd gatherings, enhancing intelligent monitoring.

  • Multimedia Content Recommendation: Recommends relevant long-video content based on users’ viewing history, increasing engagement and retention.

© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...