Fugu-MT 論文翻訳(概要): Analyzing and Mitigating Surface Bias in Code Evaluation Metrics

論文の概要: Analyzing and Mitigating Surface Bias in Code Evaluation Metrics

arxiv url: http://arxiv.org/abs/2509.15397v2
Date: Tue, 07 Oct 2025 22:47:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 14:21:18.141893
Title: Analyzing and Mitigating Surface Bias in Code Evaluation Metrics
Title（参考訳）: コード評価基準における表面バイアスの分析と緩和
Authors: Simantika Bhattacharjee Dristi, Matthew B. Dwyer,
Abstract要約: 我々は4つの基準ベースコード評価指標(CEM)を批判的に評価する。本稿では,CEM評価ベンチマークであるLoCaLを提案する。その結果, 4つのCEMは, ベースラインに比べてLoCaLの性能が著しく低下していることが判明した。
参考スコア（独自算出の注目度）: 15.211628096103473
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the increasing popularity of large language models (LLMs) and LLM-based agents, reliable and effective code evaluation metrics (CEMs) have become crucial for progress across several software engineering tasks. While popular benchmarks often provide test cases to assess the correctness of generated code, crafting and executing test cases is expensive. Reference-based CEMs provide a cheaper alternative by scoring a candidate program based on its functional similarity to a reference. Although prior research has focused on reporting the weak correlation between these CEMs and functional correctness, the causes are only assumed, and plausible solutions remain unexplored. In this work, we critically evaluate four state-of-the-art reference-based CEMs, revealing their strong bias towards surface-level features rather than code functionality. Despite this surface bias, current evaluation datasets for these CEMs rarely include code pairs that are surface-similar yet functionally dissimilar, or functionally similar yet surface-dissimilar. To mitigate this gap, we propose LoCaL (Looks Can Lie), a CEM evaluation benchmark, with 3117 code pairs at both the method and program levels. Each pair is labeled with a functional similarity score and aims to target regions where CEMs are likely to perform poorly. The functional similarity scores are calculated through differential fuzzing, which eliminates the need for predefined test cases and, at the same time, improves the reliability of the scores by executing an order of magnitude more tests than prior work. We find that all four CEMs show significant performance degradation on LoCaL, compared to the baselines. Finally, based on our findings, we draw the implication that exposing CEMs to LoCaL-like data might facilitate the development of metrics that are robust to surface bias.
Abstract（参考訳）: 大規模言語モデル(LLM)やLLMベースのエージェントの普及に伴い、信頼性が高く効果的なコード評価指標(CEM)は、いくつかのソフトウェアエンジニアリングタスクの進行に欠かせないものとなっている。人気のあるベンチマークは、生成されたコードの正確性を評価するテストケースを提供することが多いが、テストケースの作成と実行は高価である。参照ベースのCEMは、参照と機能的類似性に基づいて候補プログラムを評価することで、より安価な代替手段を提供する。これまでの研究では、これらのCEMと機能的正当性の間の弱い相関を報告することに焦点が当てられていたが、原因は推測されるのみであり、実証可能な解は未解明のままである。本研究では、4つの最先端の参照ベースのCEMを批判的に評価し、コード機能よりも表面レベルの機能に対する強いバイアスを明らかにした。この表面バイアスにもかかわらず、これらのCEMに対する現在の評価データセットは、表面が似ているが機能的に異なる、あるいは機能的に似ているが表面が類似しているコードペアをほとんど含まない。このギャップを緩和するために,CEM評価ベンチマークであるLoCaL(Looks Can Lie)を提案する。各ペアは機能的類似度スコアでラベル付けされ、CEMのパフォーマンスが低下する可能性のある領域をターゲットにしている。機能的類似度スコアは差分ファジィングによって計算され、これは事前定義されたテストケースの必要性を排除し、同時に、以前の作業よりも桁違いに多くのテストを実行することでスコアの信頼性を向上させる。その結果, 4つのCEMは, ベースラインに比べてLoCaLの性能が著しく低下していることが判明した。最後に, この結果から, CEMをLoCaLライクなデータに曝すことによって, 表面バイアスに頑健なメトリクスの開発が促進される可能性が示唆された。

論文の概要: Analyzing and Mitigating Surface Bias in Code Evaluation Metrics

関連論文リスト