OlympicArena – A Multidisciplinary Cognitive Reasoning Benchmark Framework Jointly Launched by Shanghai Jiao Tong University and AI Lab

What is the Olympic Arena?

OlympicArena is a multidisciplinary cognitive reasoning benchmark framework jointly launched by Shanghai Jiao Tong University, Shanghai AI Lab, Soochow University, and the Generative AI Research Lab (GAIR Lab) of Shanghai Jiao Tong University. OlympicArena consists of 11,163 bilingual questions sourced from international Olympic competitions, covering seven major fields: mathematics, physics, chemistry, biology, geography, astronomy, and computer science. OlympicArena comprehensively evaluates the advanced cognitive reasoning capabilities of AI models, particularly logical reasoning and visual reasoning abilities. Through fine-grained evaluation at both the answer level and the process level, OlympicArena reveals the limitations of AI models in solving complex problems, thereby advancing AI technology toward the development of superintelligence. OlympicArena – A Multidisciplinary Cognitive Reasoning Benchmark Framework Jointly Launched by Shanghai Jiao Tong University and AI Lab

The main functions of the Olympic Arena

Comprehensive Coverage: Spanning 7 core disciplines including mathematics, physics, chemistry, biology, geography, astronomy, and computer science, with a total of 34 subfields, it comprehensively evaluates the cognitive reasoning abilities of AI models across multiple disciplines.
Bilingual Support: The benchmark test is available in both Chinese and English, enhancing its international applicability.
Answer-Level Evaluation: Precisely evaluates the answers provided by AI models.
Process-Level Evaluation: Assesses each step of the problem-solving process to ensure the logicality and correctness of the AI model’s reasoning.
Multimodal Support: Supports questions that integrate text and images, evaluating the AI model’s ability to process multimodal information.

The Technical Principles of the Olympic Arena

Data Collection and Annotation: Collect questions from 62 international Olympic competitions to ensure the high quality and diversity of the questions. Conduct question extraction and annotation based on a professional team, including question classification, answer type annotation, solution step annotation, etc. Use a multi-step verification mechanism to ensure the accuracy and consistency of the annotated data.
Evaluation Method: For questions with fixed answers, verify the correctness of the model’s output based on rule matching. For questions requiring code generation, validate the correctness of the code using test cases. Compare the problem-solving steps generated by the model with the standard solution steps to evaluate the correctness of each step. For questions that are difficult to assess using rule matching, use high-performance models (such as GPT-4V) as evaluators to determine the correctness of the model’s output.
Multimodal Processing: For questions containing images, extract key information from the images based on image recognition technology and combine it with textual information to evaluate the multimodal processing ability of the AI model. Generate descriptive text for images to help the AI model better understand the content of the images.
Data Leakage Detection: Utilize N-gram prediction technology to determine whether the model has encountered the problems in the benchmark test, ensuring the fairness of the benchmark test. Perform instance-level detection for each problem to verify whether the model correctly predicts the key information in the problem.

Project address of Olympic Arena

Project Website: https://gair-nlp.github.io/OlympicArena/
GitHub Repository: https://github.com/GAIR-NLP/OlympicArena
Hugging Face Model Hub: https://huggingface.co/datasets/GAIR/OlympicArena
arXiv Technical Paper: https://arxiv.org/pdf/2406.12753

Application scenarios of Olympic Arena

AI Model Performance Evaluation: Test the cognitive reasoning ability of AI models in multidisciplinary fields.
Model Training and Optimization: Help identify model weaknesses and guide improvements in training strategies.
Education and Learning Assistance: Provide learning resources at the level of Olympic competitions to assist teaching.
Scientific Research and Discovery: Promote the application of AI in scientific research and facilitate scientific discoveries.
Technology Competitions and Challenges: Serve as an AI technology competition platform to inspire innovation and promote technological development.