Fugu-MT 論文翻訳(概要): CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation

論文の概要: CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation

arxiv url: http://arxiv.org/abs/2505.19502v1
Date: Mon, 26 May 2025 04:29:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-27 16:58:43.17019
Title: CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation
Title（参考訳）: CODE-DIING: コード評価における機能アライメントのための推論ベースのメトリクス
Authors: Guang Yang, Yu Zhou, Xiang Chen, Wei Zheng, Xing Hu, Xin Zhou, David Lo, Taolue Chen,
Abstract要約: 本稿では,精度,効率,説明性を両立させるコード評価手法であるCODE-DIINGを提案する。我々は,DeepSeek-R1671BからCODE-DIING 1.5Bおよび7Bモデルへの推論能力を効果的に伝達するデータ蒸留フレームワークを開発した。推論プロセスにおける過半数の投票戦略により、CODE-DIING 1.5Bは、同じパラメータで全てのモデルを上回ります。
参考スコア（独自算出の注目度）: 22.06897150825726
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Trustworthy evaluation methods for code snippets play a crucial role in neural code generation. Traditional methods, which either rely on reference solutions or require executable test cases, have inherent limitation in flexibility and scalability. The recent LLM-as-Judge methodology offers a promising alternative by directly evaluating functional consistency between the problem description and the generated code. To systematically understand the landscape of these LLM-as-Judge methods, we conduct a comprehensive empirical study across three diverse datasets. Our investigation reveals the pros and cons of two categories of LLM-as-Judge methods: the methods based on general foundation models can achieve good performance but require complex prompts and lack explainability, while the methods based on reasoning foundation models provide better explainability with simpler prompts but demand substantial computational resources due to their large parameter sizes. To address these limitations, we propose CODE-DITING, a novel code evaluation method that balances accuracy, efficiency and explainability. We develop a data distillation framework that effectively transfers reasoning capabilities from DeepSeek-R1671B to our CODE-DITING 1.5B and 7B models, significantly enhancing evaluation explainability and reducing the computational cost. With the majority vote strategy in the inference process, CODE-DITING 1.5B outperforms all models with the same magnitude of parameters and achieves performance which would normally exhibit in a model with 5 times of parameter scale. CODE-DITING 7B surpasses GPT-4o and DeepSeek-V3 671B, even though it only uses 1% of the parameter volume of these large models. Further experiments show that CODEDITING is robust to preference leakage and can serve as a promising alternative for code evaluation.
Abstract（参考訳）: コードスニペットの信頼できる評価方法は、ニューラルコード生成において重要な役割を果たす。従来のメソッドは、参照ソリューションに依存するか、実行可能なテストケースを必要とするが、柔軟性とスケーラビリティに固有の制限がある。最近のLCM-as-Judge方法論は、問題記述と生成されたコード間の関数的一貫性を直接評価することで、有望な代替手段を提供する。これらのLCM-as-Judge手法のランドスケープを体系的に理解するために,3つの多様なデータセットを対象とした総合的な実証的研究を行った。本研究は, LLM-as-Judge法とLLM-as-Judge法の2つのカテゴリの長所と短所を明らかにし, 一般基礎モデルに基づく手法は優れた性能を達成できるが, 複雑なプロンプトが必要であり, 説明性に欠ける。これらの制約に対処するために,精度,効率,説明可能性のバランスをとる新しいコード評価手法であるCODE-DIINGを提案する。我々は,DeepSeek-R1671BからCODE-DIING 1.5Bおよび7Bモデルへの推論能力を効果的に伝達するデータ蒸留フレームワークを開発した。推論プロセスにおける多数決戦略により、CODE-DIING 1.5Bは、同じ大きさのパラメータで全てのモデルを上回り、通常5倍のパラメータスケールのモデルで表されるパフォーマンスを達成する。 CODE-DIING 7B は GPT-4o と DeepSeek-V3 671B を上回っている。さらなる実験により、CODEDIttingは、リークの好みに頑丈であり、コード評価の有望な代替手段として機能することが示された。

論文の概要: CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation

関連論文リスト