What is Eagle 2.5?
Eagle 2.5 is a visual language model developed by NVIDIA, focusing on long-context multimodal learning. Despite its compact size of just 8 billion parameters, it excels in processing high-resolution images and long video sequences, delivering performance comparable to larger models like Qwen 2.5-VL-72B and InternVL2.5-78B.
Eagle 2.5 employs innovative training strategies: Information-First Sampling and Progressive Post-Training.
-
Information-First Sampling optimizes visual details and preserves image integrity through Image-Area Preservation (IAP) and Automatic Degradation Sampling (ADS).
-
Progressive Post-Training gradually expands the context window, ensuring stable performance across varying input lengths.
Key Features of Eagle 2.5
-
Long Video & High-Resolution Image Understanding:
-
Handles long video sequences (e.g., 512 frames) and high-resolution images effectively.
-
Achieves 72.4% on the Video-MME benchmark, rivaling larger models.
-
-
Diverse Task Support:
-
Video Benchmarks: Scores 74.8% (MVBench), 77.6% (MLVU), 66.4% (LongVideoBench).
-
Image Benchmarks: Scores 94.1% (DocVQA), 87.5% (ChartQA), 80.4% (InfoVQA).
-
-
Flexibility & Generalization:
-
Combines SigLIP vision encoding and MLP projection layers for strong adaptability across tasks.
-
Technical Innovations
-
Information-First Sampling (IFS)
-
Image-Area Preservation (IAP): Retains over 60% of the original image area while minimizing aspect ratio distortion.
-
Automatic Degradation Sampling (ADS): Dynamically balances visual and text inputs based on context length, preserving text completeness and visual details.
-
-
Progressive Post-Training
-
Expands the context window from 32K to 128K tokens, ensuring stable performance across varying input lengths and preventing overfitting.
-
-
Custom Dataset (Eagle-Video-110K)
-
Designed for long-video understanding with dual annotation methods:
-
Top-down: Story-level segmentation with human-labeled chapter metadata.
-
Bottom-up: GPT-4o-generated QA pairs for short clips.
-
-
Uses cosine similarity filtering to ensure diversity and narrative coherence.
-
-
Vision Encoding & Projection
-
Integrates SigLIP vision encoding and MLP projection layers to align visual embeddings with the language model’s representation space.
-
Project Links
-
Official Website: https://nvlabs.github.io/EAGLE/
-
arXiv Paper: https://arxiv.org/pdf/2504.15271
Applications
-
Smart Video Analysis:
-
Real-time video stream processing for surveillance (e.g., anomaly detection and alert generation).
-
-
High-Resolution Image Processing:
-
Image classification, object detection, and caption generation.
-
-
Content Creation & Marketing:
-
Generates high-quality image descriptions and video scripts for ads and social media.
-
-
Education & Training:
-
Provides explanatory text for educational videos/images to aid learning.
-
-
Autonomous Vehicles & Robotics:
-
Processes visual data from cameras and integrates text instructions for decision-making.
-