Multi-SWE-bench – ByteDance’s open-source multi-language code repair benchmark

What is Multi-SWE-bench?

Multi-SWE-bench is the first multi-language code repair benchmark open-sourced by the Doubaomodel team at ByteDance. Building upon SWE-bench, it extends coverage to 7 mainstream programming languages beyond Python, including Java, TypeScript, JavaScript, Go, Rust, C, and C++, making it a truly “full-stack engineering” evaluation benchmark. The dataset comprises 1,632 real-world repair tasks sourced from GitHub issues, rigorously screened and manually verified to ensure each sample includes a clear problem description, a correct fix patch, and a reproducible runtime testing environment. Additionally, it introduces a task difficulty grading mechanism, categorizing problems into easy, medium, and hard levels, covering development challenges ranging from single-line modifications to multi-file, multi-step, and multi-semantic dependency tasks.

Multi-SWE-bench – ByteDance's open-source multi-language code repair benchmark

The main functions of Multi-SWE-bench

Multilingual Code Fix Evaluation: As the industry’s first multilingual code fix benchmark dataset, Multi-SWE-bench for the first time covers 7 mainstream programming languages in addition to Python, including Java, TypeScript, JavaScript, Go, Rust, C, and C++. This enables the dataset to more comprehensively evaluate the automatic code fix capabilities of large models in different programming language environments.
Task Difficulty Grading: The dataset introduces a task difficulty grading mechanism, categorizing problems into three levels: Easy, Medium, and Hard. This grading approach covers development challenges ranging from single-line modifications to multi-file, multi-step, and multi-semantic dependency tasks, allowing for a more systematic evaluation of large models’ performance across different capability levels.
Real Data Support: The 1,632 instances in Multi-SWE-bench are all sourced from real open-source repositories (GitHub issues). They have been curated through unified testing standards and professional developer review and screening. Each sample is equipped with a clear problem description, a correct fix patch, and a reproducible runtime testing environment, ensuring the quality and practicality of the dataset.

The Technical Principles of Multi-SWE-Bench

Data Source and Quality Control: The 1,632 instances in the dataset are all derived from real open-source repositories (GitHub issues) and unified testing standards and review by professional developers. During the construction process, the team adopted a rigorous five-stage data construction workflow:
◦ Open-source Repository Selection: High-quality project repositories were selected from GitHub public repositories based on multiple dimensions.
◦ Pull Request Crawling: Collect pull requests (PRs) related to issues and extract key information.
◦ Docker Environment Construction: Build corresponding Docker containers for each PR to ensure the full runnability of each task in the dataset.
◦ PR Filtering and Validation: Identify valid fix behaviors through a three-status testing process (original state, applying only the test patch, and applying both the test and fix patches).
◦ Manual Verification: Introduce a double-annotation process by human experts to ensure the reliability and accuracy of the data.
Reinforcement Learning Support: To facilitate the application of Reinforcement Learning (RL) in code repair tasks, the team has open-sourced Multi-SWE-RL. This community provides 4,723 structured training samples, each equipped with a reproducible Docker environment. It supports one-click startup, automatic evaluation, and quick integration with RL training frameworks. This “evaluation + training” dual-drive model offers robust support for the continuous optimization of large models.

The project address of Multi-SWE-bench

Project Website: https://multi-swe-bench.github.io/#/
GitHub Repository: https://github.com/multi-swe-bench/multi-swe-bench
Hugging Face Dataset: https://huggingface.co/datasets/ByteDance-Seed/Multi-SWE-bench
arXiv Technical Paper: https://arxiv.org/pdf/2504.02605

Application scenarios of Multi-SWE-bench

Code Fix Automation: Developers can use models trained with Multi-SWE-bench to automatically identify and fix bugs in code, reducing the time and effort required for manual debugging.
Model Performance Evaluation and Improvement: The dataset provides a systematic evaluation benchmark for large models, helping developers and researchers assess model performance across different programming languages and task difficulties.
Programming Language Comparison Research: By comparing bug-fixing capabilities across different programming languages, researchers can gain deeper insights into the strengths and limitations of each language.
Intelligent Learning and Education: For developers and learners, Multi-SWE-bench serves as a platform for learning and skill enhancement. By studying and using the dataset, developers can better understand common errors and fixing methods in different programming languages, improving their programming skills and problem-solving abilities.