BrowseComp – An OpenAI-open-sourced Benchmark for the Web Browsing Capabilities of AI Agents

What is BrowseComp?

BrowseComp is a benchmark test open-sourced by OpenAI for evaluating the web browsing capabilities of AI agents. It contains 1,266 highly challenging questions covering multiple domains such as movies, science and technology, art, history, sports, music, and video games. These questions require AI agents to search the internet and match complex constraints, such as identifying specific football matches or TV show characters. In the test, OpenAI’s GPT-4o and GPT-4.5 achieved extremely low accuracy rates, while the newly released Agent model, Deep Research, demonstrated a significantly higher accuracy rate of 51.5%, showcasing its advantages in autonomous searching, information integration, and accuracy calibration.

The main functions of BrowseComp

Evaluation of Complex Information Retrieval Ability: BrowseComp consists of 1,266 highly challenging questions covering multiple domains such as movies, science and technology, art, history, sports, music, and video games. These questions require AI agents to conduct in-depth searches in the vast Internet space and match potential answers with the complex constraints presented in the questions.
Strict Control of Problem Difficulty: To ensure the high difficulty of the problems, data scientists employ three main checkpoints for strict control: verifying that existing models (such as OpenAI’s GPT-4o, GPT-4.5, and earlier versions of Deep Research) cannot solve these problems; conducting five simple Google searches to ensure that the answers do not appear on the first page of search results; and ensuring that the problems are sufficiently difficult so that another data scientist cannot solve them within ten minutes.
Reliability of Answer Verification: Despite the high difficulty of the problems, the answers are concise and clear, making them easy to verify against reference answers. This design makes the benchmark testing both challenging and fair.
Promoting the Development of AI Browsing Agent Technology: The open-sourcing of BrowseComp provides new tools and directions for the research of AI browsing agents, driving the development of smarter and more reliable browsing agents.

The technical principle of BrowseComp

Complex Problem Design: BrowseComp contains 1,266 highly challenging problems that require AI agents to perform multi-step reasoning and retrieve information across multiple websites on the Internet. The design goal of these problems is to simulate complex information retrieval scenarios in the real world, requiring AI agents to handle difficult-to-obtain and interrelated information.
Multi-source Information Integration: AI agents need to access multiple websites and integrate information from different sources to find the answers to questions. For example, a typical question may require the agent to visit various websites, such as sports event records and referee information, before arriving at the correct answer.
Reasoning and Search Strategies: Beyond simple information retrieval, AI agents also need to possess strong reasoning capabilities to logically analyze and synthesize information based on retrieved data. For instance, the Deep Research model performs excellently in BrowseComp because it can autonomously adjust its search strategies and dynamically optimize the search path based on retrieval results.
Dynamic Adaptability: AI agents need to have dynamic adaptability, enabling them to quickly react and adjust their search strategies based on various types of information encountered during the search process. This adaptability allows agents to more effectively locate target information in complex network environments.
Impact of Computational Resources: Test results show that increasing computational resources can significantly enhance the performance of AI agents in complex web browsing tasks. More computational resources allow agents to explore more search paths, thereby increasing the probability of finding the correct answers.

The model performance of BrowseComp

GPT-4o and GPT-4.5: Both models perform poorly on BrowseComp, with accuracy rates of 0.6% and 0.9%, respectively. Even after enabling the browsing function for GPT-4o, the accuracy only increased from 0.6% to 1.9%. This indicates that simply endowing the model with browsing capabilities cannot effectively solve the complex problems in BrowseComp.
OpenAI o1 model: Although it lacks browsing capabilities, it demonstrates a strong reasoning ability, achieving an accuracy rate of 9.9%. This highlights the importance of reasoning skills in web-based tasks. Even without the ability to directly retrieve information from the internet, the model can find answers to certain questions through in-depth reasoning based on existing knowledge.
Deep Research Model: It is the latest Agent model released by OpenAI and performs the best in the BrowseComp test, with an accuracy rate as high as 51.5%. The model can use browsing tools efficiently and deeply analyze and comprehensively process the retrieved information. The Deep Research Model has strong adaptability and can quickly respond and adjust the search strategy based on various types of information obtained during the search process.

The project address of BrowseComp

Project official website: https://openai.com/index/browsecomp/
Github repository: https://github.com/openai/simple-evals
Technical Paper: https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf

Application scenarios of BrowseComp

Enterprise Knowledge Library Intelligent Retrieval: It can be used for intelligent retrieval in enterprise knowledge libraries. For example, a large number of research documents can be transformed into an intelligent Q&A system, improving the information query efficiency of R & D personnel.
E-commerce Product Shopping Guide: In the e-commerce field, it can be used to build an intelligent shopping guide system to help users quickly find products that meet complex needs.
Government Information Disclosure Service: Government agencies can use it to provide more efficient information disclosure services, helping the public quickly obtain the required policies, regulations and other information.
Research and Development: Researchers can use it to test and improve the reasoning and search strategies of AI models, promoting the further development of AI technology in the field of information retrieval.