GDPVAL – An Open-Source Framework by OpenAI for Evaluating the Economic Value of AI Models

What is GDPVAL？

GDPVAL is a new evaluation framework launched by OpenAI to measure the performance of AI models on tasks with real economic value. GDPVAL selects 44 occupations from the nine industries that contribute the most to U.S. GDP and designs 1,320 real-world tasks (the open-source version includes 220). These tasks span software development, legal writing, mechanical engineering, nursing plans, and other fields. They are created by professionals with an average of 14 years of experience and undergo multiple rounds of review to ensure alignment with real workplace scenarios. The goal of GDPVAL is to assess the economic value of AI through real tasks and help people better understand the potential of AI in real-world applications.

Main Functions of GDPVAL

Evaluate the economic value of AI: Measures how AI models perform on economically valuable work tasks, helping to understand their potential in real-world applications.
Cover diverse professions: Includes 44 occupations (such as software development, law, nursing, etc.) across nine industries with the largest contributions to U.S. GDP, ensuring broad coverage and representativeness.
Reflect real workplace scenarios: Task design is based on authentic work products (e.g., legal briefs, engineering blueprints) and includes reference documents and context. Deliverables include documents, slides, charts, and more.
Expert-reviewed and scored: Tasks are created by professionals with an average of 14 years of experience and undergo multiple review rounds. Scoring is conducted by industry peers to ensure accuracy and reliability.
Advance AI development: By evaluating models with real tasks, the framework provides directions for improvement and helps drive AI progress.

Technical Principles of GDPVAL

Task design: Based on the nine industries that contribute most to U.S. GDP (such as finance, healthcare, and manufacturing). From each industry, five occupations with the highest wage bill contributions are selected, focusing mainly on knowledge work (at least 60% of tasks must not involve physical labor). Tasks are designed by professionals averaging 14 years of experience and reviewed multiple times to ensure representativeness and feasibility.
Evaluation process: AI-generated outputs are blind-reviewed against human expert work by industry specialists. Scoring criteria include “better,” “comparable,” or “worse.” An “auto-scorer” (an AI system) is also developed to predict human expert ratings as an experimental research tool.
Data collection and analysis: Task data comes from real work scenarios and includes multiple deliverable types (documents, slides, charts, etc.). By comparing outputs of different AI models, the framework analyzes performance across tasks and tracks progress trends.

Project Resources

Project website: https://openai.com/index/gdpval/
Hugging Face dataset: https://huggingface.co/datasets/openai/gdpval
Technical paper: https://cdn.openai.com/pdf/d5eb7428-c4e9-4a33-bd86-86dd4bcf12ce/GDPval.pdf

Application Scenarios of GDPVAL

AI model performance evaluation: Used to assess AI models’ capabilities on real economic tasks, helping developers and researchers understand their effectiveness in actual workplace contexts.
Human-AI collaboration in industries: Provides a framework for industry experts to evaluate AI’s potential in professional tasks, enabling more effective human-AI collaboration.
Workforce training and development: Offers insights for vocational training by showing the scope of AI capabilities, helping workers plan career development paths.
Enterprise decision-making support: Assists businesses in deciding whether to adopt AI models to optimize processes, particularly in terms of cost reduction and efficiency improvement.