What is Eagle 2.5?
Eagle 2.5 is a visual language model developed by NVIDIA, focusing on long-context multimodal learning. Despite its compact size of just 8 billion parameters, it excels in processing high-resolution images and long video sequences, delivering performance comparable to larger models like Qwen 2.5-VL-72B and InternVL2.5-78B.
Eagle 2.5 employs innovative training strategies: Information-First Sampling and Progressive Post-Training.
- 
Information-First Sampling optimizes visual details and preserves image integrity through Image-Area Preservation (IAP) and Automatic Degradation Sampling (ADS).
 - 
Progressive Post-Training gradually expands the context window, ensuring stable performance across varying input lengths.
 

Key Features of Eagle 2.5
- 
Long Video & High-Resolution Image Understanding:
- 
Handles long video sequences (e.g., 512 frames) and high-resolution images effectively.
 - 
Achieves 72.4% on the Video-MME benchmark, rivaling larger models.
 
 - 
 - 
Diverse Task Support:
- 
Video Benchmarks: Scores 74.8% (MVBench), 77.6% (MLVU), 66.4% (LongVideoBench).
 - 
Image Benchmarks: Scores 94.1% (DocVQA), 87.5% (ChartQA), 80.4% (InfoVQA).
 
 - 
 - 
Flexibility & Generalization:
- 
Combines SigLIP vision encoding and MLP projection layers for strong adaptability across tasks.
 
 - 
 
Technical Innovations
- 
Information-First Sampling (IFS)
- 
Image-Area Preservation (IAP): Retains over 60% of the original image area while minimizing aspect ratio distortion.
 - 
Automatic Degradation Sampling (ADS): Dynamically balances visual and text inputs based on context length, preserving text completeness and visual details.
 
 - 
 - 
Progressive Post-Training
- 
Expands the context window from 32K to 128K tokens, ensuring stable performance across varying input lengths and preventing overfitting.
 
 - 
 - 
Custom Dataset (Eagle-Video-110K)
- 
Designed for long-video understanding with dual annotation methods:
- 
Top-down: Story-level segmentation with human-labeled chapter metadata.
 - 
Bottom-up: GPT-4o-generated QA pairs for short clips.
 
 - 
 - 
Uses cosine similarity filtering to ensure diversity and narrative coherence.
 
 - 
 - 
Vision Encoding & Projection
- 
Integrates SigLIP vision encoding and MLP projection layers to align visual embeddings with the language model’s representation space.
 
 - 
 
Project Links
- 
Official Website: https://nvlabs.github.io/EAGLE/
 - 
arXiv Paper: https://arxiv.org/pdf/2504.15271
 
Applications
- 
Smart Video Analysis:
- 
Real-time video stream processing for surveillance (e.g., anomaly detection and alert generation).
 
 - 
 - 
High-Resolution Image Processing:
- 
Image classification, object detection, and caption generation.
 
 - 
 - 
Content Creation & Marketing:
- 
Generates high-quality image descriptions and video scripts for ads and social media.
 
 - 
 - 
Education & Training:
- 
Provides explanatory text for educational videos/images to aid learning.
 
 - 
 - 
Autonomous Vehicles & Robotics:
- 
Processes visual data from cameras and integrates text instructions for decision-making.
 
 -