VitaBench – A large model agent evaluation benchmark developed by Meituan

What is VitaBench？

VitaBench is an evaluation benchmark for large-model intelligent agents, developed by Meituan’s LongCat team. It focuses on assessing agents’ performance in complex, real-world problem-solving scenarios such as food delivery, dining, and travel. VitaBench builds an interactive evaluation environment with 66 tools and designs cross-scenario integrated tasks to measure agents’ abilities in deep reasoning, tool use, and user interaction. It is the first benchmark to quantitatively decompose agent tasks, construct a large-scale real-world environment database, introduce realistic user simulators, and adopt atomic evaluation rubrics for fine-grained behavioral assessment.

Main Features of VitaBench

1. Construction of Complex Task Environments:
VitaBench uses high-frequency daily scenarios such as food delivery, dining, and travel as the foundation. It builds an interactive evaluation environment with 66 tools and designs cross-scenario integrated tasks to simulate complex real-world demands.

2. Quantification of Task Complexity Dimensions:
VitaBench measures task complexity along three dimensions—deep reasoning, tool use, and user interaction. It evaluates reasoning complexity using indicators such as observation space size, partial observability, and reasoning steps. It distinguishes tool complexity through single-scenario and cross-scenario tasks, and uses realistic user simulators to measure interaction complexity.

3. Fine-Grained Evaluation:
Inspired by recent research, VitaBench decomposes task objectives into a set of atomic evaluation rubrics. By scanning complete dialogue trajectories with overlapping sliding windows and applying a strict “all-or-nothing” criterion, it achieves comprehensive and fine-grained behavior coverage.

4. Open-Source Resources:
VitaBench is fully open-sourced, including its project homepage, research paper, code repository, and dataset. It provides rich resources for researchers and developers, facilitating advancements in real-world intelligent agent research and deployment.

Technical Principles of VitaBench

Multi-Dimensional Complexity Construction:
VitaBench models complex tasks along three key dimensions—deep reasoning, tool use, and user interaction—to simulate the intricacies of real-life scenarios.
Real-World Environment Database:
It builds a large-scale real-world environment database, offering partially observable environments to test agents’ reasoning capabilities under uncertainty.
User Simulator:
A realistic user simulator models diverse user behaviors and preferences, enabling agents to adapt to multi-turn interactions with varied user types.
Atomic Evaluation Rubrics:
Task goals are decomposed into atomic rubrics. Dialogue trajectories are analyzed using a sliding-window mechanism for fine-grained behavioral evaluation.
Cross-Scenario Task Design:
VitaBench features cross-scenario integrated tasks to evaluate agents’ ability to switch between contexts, integrate information, and perform comprehensively in real-world environments.

Project Resources

Official Website: https://vitabench.github.io
GitHub Repository: https://github.com/meituan-longcat/vitabench
arXiv Paper: https://arxiv.org/abs/2509.26490
HuggingFace Dataset: https://huggingface.co/datasets/meituan-longcat/VitaBench

Application Scenarios

1. Food Delivery:
Simulates complex ordering scenarios involving user preferences, budgets, and time constraints. Evaluates an agent’s ability to understand user needs, recommend suitable options, and complete orders through multi-turn dialogues.

2. Dining:
Covers the entire dining process—from restaurant search and table reservation to ordering and payment. Tests the agent’s reasoning and tool usage capabilities, such as recommending suitable restaurants or handling reservations and menu queries.

3. Travel and Transportation:
Includes trip planning, transportation booking, and attraction recommendations. Evaluates the agent’s ability to integrate multiple tools and data sources to generate personalized travel plans across different scenarios.

4. Intelligent Agent Development and Evaluation:
Provides a standardized benchmark for researchers and developers to assess and optimize intelligent agents’ performance on complex tasks, promoting innovation and practical deployment.

5. Human-Agent Interaction Research:
Using realistic user simulators and multi-turn dialogue tasks, VitaBench supports research on interaction patterns between humans and agents, improving natural language understanding and dialogue management capabilities.