AutoCodeBench – Tencent Hunyuan’s Open-Source Benchmark Dataset for Evaluating Code Generation Capabilities of Large Models

What is AutoCodeBench?

AutoCodeBench is a benchmark dataset launched by Tencent Hunyuan specifically for evaluating the coding capabilities of large models. It contains 3,920 problems evenly distributed across 20 programming languages. The dataset is designed with high difficulty, practicality, and diversity, enabling comprehensive assessment of large models’ performance on multilingual programming tasks. The benchmark dataset is generated through an automated workflow to ensure high quality and coverage. It provides both a simplified version (AutoCodeBench-Lite) and a version tailored for base model evaluation (AutoCodeBench-Complete).

Main Features of AutoCodeBench

Multilingual Code Evaluation: Offers 3,920 problems covering 20 programming languages, providing a comprehensive assessment of large models’ multilingual code generation abilities.
High-Difficulty Benchmarking: Includes challenging problems that effectively identify weaknesses of large models in complex programming tasks.
Amplifying Performance Differences: AutoCodeBench-Lite is constructed from filtered problems to highlight performance gaps among different models, making comparisons more effective.
Base Model Evaluation: AutoCodeBench-Complete, designed with 3-shot prompts, is specifically for evaluating the coding performance of base models.
Automated Code Data Generation: Test inputs are generated via LLMs, executed in sandbox environments to obtain outputs, and then synthesized into high-quality multilingual code generation data.
Multilingual Code Execution Verification: Provides the MultiLanguageSandbox service, supporting compilation and execution in over 30 programming languages, to verify correctness of generated code.

Technical Principles of AutoCodeBench

Automated Data Generation: AutoCodeGen leverages LLMs to generate test inputs, which are passed to sandbox environments for execution. The sandbox returns test outputs that are then used to build high-quality test functions. Problems are constructed in reverse order to ensure high difficulty and diversity. Multiple filtering strategies are applied to further guarantee quality, difficulty, and practicality.
Multilingual Support: The 3,920 problems are evenly distributed across 20 programming languages to ensure balanced evaluation and avoid skewed distributions. The MultiLanguageSandbox supports compilation and execution in over 30 programming languages, verifying correctness and performance of generated code across diverse language environments.
High Difficulty & Practicality: By employing reverse construction and filtering strategies, the generated problems are challenging and have real-world application value. They effectively reflect complex scenarios encountered in real programming tasks, helping models perform better in practice.

Project Links

Project Website: https://autocodebench.github.io/
GitHub Repository: https://github.com/Tencent-Hunyuan/AutoCodeBenchmark
HuggingFace Dataset: https://huggingface.co/datasets/tencent/AutoCodeBenchmark
arXiv Paper: https://arxiv.org/pdf/2508.09101

Application Scenarios of AutoCodeBench

Model Performance Evaluation: Used to comprehensively assess large models’ code generation ability across multilingual programming tasks, helping identify strengths and weaknesses.
Dataset Construction & Optimization: Generates high-quality, high-difficulty code generation datasets; supports custom dataset building to enhance training effectiveness.
Multilingual Capability Verification: Validates large models’ performance in different programming languages, including low-resource ones, advancing research in multilingual programming.
Model Training & Validation: Serves as supplemental training data to improve performance on complex coding tasks and to periodically validate training results.
Academic & Industrial Applications: Provides standardized benchmarks for academic research, while supporting the development and optimization of code generation tools in industrial scenarios.