Microsoft Launches SWE-bench-Live Code Evaluation Benchmark
The newly released SWE-bench-Live by Microsoft is a cutting-edge code repair evaluation benchmark designed to address the issues of outdated data and limited coverage in traditional benchmarks. Leveraging the agent-based intelligent framework REPOLAUNCH, SWE-bench-Live can automatically set up Docker environments and update them in real time, eliminating risks such as model overfitting and data contamination. The initial evaluation results show that existing large models exhibit a significant performance drop in dynamic environments, highlighting the importance of real-time and diverse data assessment.
© Copyright Notice
The copyright of the article belongs to the author. Please do not reprint without permission.
Related Posts
No comments yet...