UNO-Bench – An Omni-Modal Large Model Evaluation Benchmark Released by Meituan LongCat

AI Tools updated 3d ago dongdong
15 0

What is UNO-Bench?

UNO-Bench is an omni-modal large model evaluation benchmark introduced by Meituan’s LongCat team. Designed to address the gaps in existing evaluation systems, UNO-Bench uses high-quality and diverse data to accurately measure both single-modal and omni-modal capabilities. The benchmark is the first to verify the “composition law” of omni-modal large models, revealing the complex interplay between single-modal and omni-modal abilities. With its innovative multi-step open-ended questions and efficient data compression algorithms, UNO-Bench improves both evaluation granularity and efficiency, providing a scientific assessment tool that advances the development of omni-modal foundation models.

UNO-Bench – An Omni-Modal Large Model Evaluation Benchmark Released by Meituan LongCat


Main Features of UNO-Bench

Accurate capability assessment:
Uses high-quality, diverse datasets to evaluate model performance across single-modal and omni-modal tasks involving images, audio, video, and text.

Revealing capability composition laws:
Verifies the “composition law” of omni-modal large models for the first time, exposing the intricate relationships between single-modal and omni-modal abilities and offering theoretical support for model optimization.

Innovative evaluation methodology:
Introduces Multi-step Open-ended (MO) questions to assess how well models maintain reasoning depth across complex tasks, effectively distinguishing differences in reasoning capabilities.

Efficient data management:
Adopts a clustering-guided hierarchical sampling method to greatly reduce evaluation costs while maintaining highly consistent model rankings.

Support for multimodal fusion research:
Provides a unified evaluation framework that helps researchers explore multimodal fusion and paves the way for more powerful omni-modal models in the future.


Technical Principles Behind UNO-Bench

Unified capability framework:
Model capabilities are decomposed into a perception layer and a reasoning layer.

  • The perception layer covers fundamental skills such as basic recognition and cross-modal alignment.

  • The reasoning layer includes advanced tasks such as spatial and temporal reasoning.
    This dual-dimensional framework provides a clear blueprint for data construction and model evaluation.

High-quality data construction:

  • Data collection & annotation: Data is manually labeled with multi-round quality checks to ensure high quality and diversity. Over 90% of the dataset is private and original, preventing data contamination.

  • Cross-modal solvability: Ablation studies ensure that more than 98% of questions genuinely require multimodal information, avoiding redundancy from single-modal cues.

  • Audio-visual recombination: Audio content is designed independently and manually paired with visual materials to break information redundancy, compelling models to perform true cross-modal fusion.

  • Data optimization & compression: A clustering-guided hierarchical sampling method selects representative samples from large-scale data, reducing evaluation costs while preserving ranking consistency.

Innovative evaluation methods:
Complex reasoning tasks are decomposed into multiple sub-questions requiring open-ended text answers. Expert-weighted scoring provides precise measurement of reasoning capabilities. Through fine-grained question type categorization and iterative multi-round annotation, the system automatically scores multiple types of questions with an accuracy of up to 95%.

Verification of the composition law:
Regression analysis and ablation experiments show that omni-modal performance is not a simple linear sum of single-modal abilities. Instead, it follows a power-law synergistic pattern. This nonlinear relationship provides a new analytical framework for evaluating fusion efficiency.


Project Links for UNO-Bench


Application Scenarios of UNO-Bench

Model development and optimization:
Provides standard evaluation tools to help developers refine model architectures and improve multimodal fusion capabilities.

Industry application evaluation:
Useful in fields such as intelligent customer service and autonomous driving for assessing model performance in multimodal interaction scenarios and improving user experience.

Academic research and competitions:
Serves as a unified benchmark for academic analysis, model comparison, and multimodal competitions, driving technological breakthroughs.

Product development and market evaluation:
Helps companies assess product functionality and competitive positioning, offering scientific support for multimodal product development.

Cross-modal application development:
Supports domains such as multimedia content creation and intelligent security, enhancing performance and reliability in multimodal applications.

© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...