Fugu-MT 論文翻訳(概要): AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

論文の概要: AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

arxiv url: http://arxiv.org/abs/2508.09101v1
Date: Tue, 12 Aug 2025 17:29:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-13 21:07:34.52778
Title: AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
Title（参考訳）: AutoCodeBench: 大規模言語モデルは自動コードベンチマークジェネレータである
Authors: Jason Chou, Ao Liu, Yuchi Deng, Zhiying Zeng, Tao Zhang, Haotian Zhu, Jianwei Cai, Yue Mao, Chenchen Zhang, Lingyun Tan, Ziyan Xu, Bohui Zhai, Hengyi Liu, Speed Zhu, Wiggin Zhou, Fengzong Lian,
Abstract要約: 我々はAutoCodeGenを紹介した。AutoCodeGenは、手動のアノテーションを使わずに、高度に微分可能な多言語コード生成データセットを生成する自動メソッドである。我々はAutoCodeBenchとその簡易版AutoCodeBench-Lite上で、30以上の主要なオープンソースおよびプロプライエタリなLLMを評価した。その結果、最も先進的なLLMでさえ、これらのタスクの複雑さ、多様性、多言語性に苦しむことが明らかとなった。
参考スコア（独自算出の注目度）: 11.285930594120076
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities, these benchmarks face several critical limitations. First, they often rely on manual annotations, which are time-consuming and difficult to scale across different programming languages and problem complexities. Second, most existing benchmarks focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and uneven language distribution. To address these challenges, we propose AutoCodeGen, an automated method for generating high-difficulty multilingual code generation datasets without manual annotations. AutoCodeGen ensures the correctness and completeness of test cases by generating test inputs with LLMs and obtaining test outputs through a multilingual sandbox, while achieving high data quality through reverse-order problem generation and multiple filtering steps. Using this novel method, we introduce AutoCodeBench, a large-scale code generation benchmark comprising 3,920 problems evenly distributed across 20 programming languages. It is specifically designed to evaluate LLMs on challenging, diverse, and practical multilingual tasks. We evaluate over 30 leading open-source and proprietary LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The results show that even the most advanced LLMs struggle with the complexity, diversity, and multilingual nature of these tasks. Besides, we introduce AutoCodeBench-Complete, specifically designed for base models to assess their few-shot code generation capabilities. We hope the AutoCodeBench series will serve as a valuable resource and inspire the community to focus on more challenging and practical multilingual code generation scenarios.
Abstract（参考訳）: 大規模言語モデル(LLM)は、様々な領域にまたがって顕著な機能を示しており、コード生成が重要な領域として現れています。コード生成能力を評価するために多くのベンチマークが提案されているが、これらのベンチマークにはいくつかの限界がある。まず、それらはしばしば手動のアノテーションに依存します。これは、異なるプログラミング言語や問題複雑度にまたがるスケールが難しく、時間を要するものです。第二に、既存のベンチマークのほとんどはPythonに重点を置いているが、少数のマルチ言語ベンチマークは、限られた困難と不均一な言語分布に悩まされている。これらの課題に対処するため,手動のアノテーションを使わずに多言語コード生成データセットを自動生成するAutoCodeGenを提案する。 AutoCodeGenは、テストインプットをLLMで生成し、多言語サンドボックスを通じてテストアウトプットを取得することで、テストケースの正確性と完全性を保証すると同時に、逆順序問題生成と複数のフィルタリングステップを通じて高いデータ品質を実現する。この手法を用いて,20のプログラミング言語に均等に分布する3,920個の問題からなる大規模コード生成ベンチマークであるAutoCodeBenchを紹介する。難易度、多様性、実用的な多言語タスクにおいてLLMを評価するように設計されている。我々はAutoCodeBenchとその簡易版AutoCodeBench-Lite上で、30以上の主要なオープンソースおよびプロプライエタリなLLMを評価した。その結果、最も先進的なLLMでさえ、これらのタスクの複雑さ、多様性、多言語性に苦しむことが明らかとなった。さらに、ベースモデル用に特別に設計されたAutoCodeBench-Completeを導入し、その数ショットのコード生成機能を評価します。 AutoCodeBenchシリーズが貴重なリソースとして機能し、コミュニティがより困難で実用的な多言語コード生成シナリオに集中するよう促すことを願っています。

論文の概要: AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

関連論文リスト