The Stanford Medical AI Evaluation Results Are Announced

AI Daily News updated 2m ago dongdong
26 0

Stanford University recently conducted a comprehensive evaluation of large language models’ performance in clinical medical tasks, known as the MedHELM study. The assessment covered 35 benchmark tests and 22 subcategories of medical tasks, designed to simulate the daily work of clinicians. The results showed that DeepSeek R1 led other models with a 66% win rate and a macro-average score of 0.75, followed closely by o3-mini and Claude 3.7 Sonnet. The study also found significant variations in model performance across different task types, with free-text generation tasks yielding better results, while structured reasoning tasks required stronger domain knowledge.

© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...