HealthBench – An open-source medical testing benchmark launched by OpenAI

What is HealthBench?

HealthBench is an open-source medical evaluation benchmark launched by OpenAI to assess the performance and safety of large language models (LLMs) in the healthcare domain. It contains 5,000 multi-turn conversations between models and either users or healthcare professionals, evaluated using dialogue-specific rubrics developed by 262 physicians. These conversations span a wide range of health-related contexts (e.g., emergency situations, clinical data transformation, global health) and behavioral dimensions (e.g., accuracy, instruction-following, communication).

HealthBench enables both overall and fine-grained evaluations of model performance—by topic (e.g., emergency triage, global health) and behavioral dimension (e.g., clinical accuracy, communication quality). This helps diagnose specific model behaviors and identify dialogue types and performance dimensions that need improvement.

Key Features of HealthBench

Multidimensional Evaluation: Provides overall scores as well as breakdowns by topic (e.g., emergency triage, global health) and behavioral dimension (e.g., accuracy, communication quality).
Performance and Safety Assessment: Measures how well models perform across various health-related tasks and evaluates their safety, particularly in high-risk scenarios.
Guidance for Model Improvement: Offers detailed performance analytics to help developers identify strengths and weaknesses in their models and guide improvement efforts.
Benchmarking and Comparison: Serves as a unified evaluation standard, making it easier to compare models and choose the most suitable one for healthcare use cases.
Variants Support: Includes two variants—HealthBench Consensus and HealthBench Hard—which focus on particularly critical behavioral dimensions and especially difficult conversations, respectively.

Technical Principles of HealthBench

Rubrics: Each conversation is paired with a physician-authored rubric tailored to that dialogue. Rubrics include multiple evaluation criteria, each associated with a specific score (positive or negative), used to assess various aspects of the model response (e.g., accuracy, completeness, communication quality).
Model Response Scoring: The model generates a response to the final user message in each conversation. A model-based grader scores the response by determining whether each rubric criterion is met. Points are awarded only if the criterion is satisfied.
Overall Score Calculation: The model’s overall score is derived from the average of all dialogue scores. Results are further broken down by topics (themes) and behavioral axes to provide more granular performance insights.
Model Validation and Improvement: The accuracy of the model-based grader is validated against human (physician) scoring. Adjustments are made as needed to ensure reliable and valid evaluation results.

Project Links

Official Website: https://openai.com/index/healthbench
GitHub Repository: https://github.com/openai/simple-evals
Technical Paper: https://cdn.openai.com/pdf/healthbench

Application Scenarios of HealthBench

Model Performance Evaluation: Assess the performance of LLMs in the healthcare field across multiple dimensions including accuracy, completeness, and communication quality.
Safety Testing: Evaluate model reliability and safety in high-risk medical contexts such as emergency triage, to ensure no harmful advice is given.
Model Improvement Guidance: Use detailed performance analysis to identify model strengths and weaknesses, helping guide development priorities.
Benchmarking and Comparison: Provide a standardized evaluation framework to compare models and identify the most suitable ones for healthcare applications.
Support for Healthcare Professionals: Assist medical professionals in evaluating and selecting AI tools that fit their workflows, ultimately improving healthcare efficiency and quality.