Deep Video Discovery – Microsoft’s AI Agent for In-depth Video Exploration

What is Deep Video Discovery？

Deep Video Discovery (DVD) is a deep video exploration agent developed by Microsoft, designed specifically to understand and analyze long-form videos. It segments long videos into shorter clips and utilizes the advanced reasoning capabilities of large language models (LLMs) to autonomously plan and select appropriate tools and parameters for information collection. Deep Video Discovery is equipped with a suite of search tools—global browsing, clip search, and frame inspection—that enable multi-level information extraction. Through iterative reasoning, it gradually builds a comprehensive understanding of video content. The system has achieved state-of-the-art performance on several long-video comprehension benchmarks, significantly improving the accuracy and efficiency of long-form video analysis.

Deep Video Discovery – Microsoft's AI Agent for In-depth Video Exploration

Key Features of Deep Video Discovery

Multi-granular Video Understanding: Analyzes video content at three levels—global, clip, and frame—to provide comprehensive understanding.
Autonomous Search and Reasoning: Independently plans and executes search strategies, dynamically choosing tools and parameters based on user queries to iteratively gather information.
Efficient Information Retrieval: Rapidly identifies and extracts relevant video segments and details using tools like global browsing, clip search, and frame inspection.
Long Video Comprehension: Specializes in handling information-dense videos lasting several hours, effectively addressing temporal and spatial complexities.
Flexible Tool Usage: Adapts tool combinations based on task requirements, enabling efficient video content analysis and accurate question answering.

Technical Principles Behind Deep Video Discovery

Multi-granular Video Database Construction: Long videos are evenly segmented into multiple short clips (approximately 5 seconds each). Information is extracted at three levels:
- Global Level: Summarizes main topics and events.
- Clip Level: Provides captions for each segment.
- Frame Level: Retains raw pixel data.
  This data forms a structured database containing decoded frames, text captions, and embedding vectors to support fast retrieval and detailed analysis.
Autonomous Search and Answer Generation:
- Global Browse: Offers a high-level summary to help the agent quickly understand the main topics and events.
- Clip Search: Uses text embedding matching to rapidly find clips relevant to user queries.
- Frame Inspect: Conducts fine-grained visual question answering (VQA) within a specified time range to extract detailed frame-level information.
Autonomous Agent Design:
The agent operates on an observe-reason-act loop. Leveraging LLMs’ reasoning ability, it dynamically selects and utilizes tools to progressively build its understanding.
Iterative Reasoning:
The agent continuously updates its search strategy based on current observations and reasoning results, refining queries to produce accurate answers.
LLM-driven Reasoning:
The LLM serves as the core planner and reasoner, selecting tools and parameters based on conversation history and current observations. It dynamically adjusts strategies and constructs multi-step tool-use chains to handle complex queries.

Project Link

arXiv Technical Paper

Application Scenarios of Deep Video Discovery

Education: Enables online learning platforms to analyze long video lectures and help students quickly locate specific knowledge points or chapters.
Sports Analytics: Analyzes game footage to rapidly extract key events in sports matches.
Video Surveillance: Supports real-time analysis of surveillance footage, enabling fast detection of abnormal behaviors or events.
Film Production: Assists post-production teams in analyzing raw footage to quickly locate desired scenes.
Corporate Meetings: Helps businesses analyze recorded meetings and extract key takeaways and decisions efficiently.