The Stanford Medical AI Evaluation Results Are Announced

AI Daily News updated 5m ago dongdong

146 0

Stanford University recently conducted a comprehensive evaluation of large language models’ performance in clinical medical tasks, known as the MedHELM study. The assessment covered 35 benchmark tests and 22 subcategories of medical tasks, designed to simulate the daily work of clinicians. The results showed that DeepSeek R1 led other models with a 66% win rate and a macro-average score of 0.75, followed closely by o3-mini and Claude 3.7 Sonnet. The study also found significant variations in model performance across different task types, with free-text generation tasks yielding better results, while structured reasoning tasks required stronger domain knowledge.

© Copyright Notice

The copyright of the article belongs to the author. Please do not reprint without permission.

Related Posts

ByteDance Launches Next-Generation AI Video Generation Model Waver 1.0

ByteDance Launches Next-Generation AI Video Generation Model Waver 1.0

2m ago

01590

Google has updated the preview version of Gemini 2.5 Pro to version 06-05, bringing multiple AI performance improvements

Google has updated the preview version of Gemini 2.5 Pro to version 06-05, bringing multiple AI performance improvements

5m ago

01280

The Qwen Team Launches the New Qwen3 Series, Featuring MoE and Dense Models with Comprehensive Performance Upgrades

The Qwen Team Launches the New Qwen3 Series, Featuring MoE and Dense Models with Comprehensive Performance Upgrades

7m ago

01480

Swedish AI startup Lovable achieves $120 million in revenue in just 3 months, creating a new miracle in the AI programming market.

Swedish AI startup Lovable achieves $120 million in revenue in just 3 months, creating a new miracle in the AI programming market.

7m ago

01690

No comments yet...

none

No comments yet...