Fugu-MT 論文翻訳(概要): LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks

論文の概要: LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks

arxiv url: http://arxiv.org/abs/2502.06215v1
Date: Mon, 10 Feb 2025 07:33:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-02-11 18:57:50.939389
Title: LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks
Title（参考訳）: LessLeak-Bench: 83のソフトウェアエンジニアリングベンチマークを対象としたLLMにおけるデータ漏洩の初調査
Authors: Xin Zhou, Martin Weyssow, Ratnadira Widyasari, Ting Zhang, Junda He, Yunbo Lyu, Jianming Chang, Beiqi Zhang, Dan Huang, David Lo,
Abstract要約: 大規模言語モデル(LLM)は、コード生成やプログラムの自動修復といったソフトウェア工学(SE)タスクで広く利用されている。広範囲かつしばしば開示されていない事前トレーニングデータセットへの依存は、データ漏洩に関する重大な懸念を提起する。本稿では,LLM に関する 83 SE ベンチマークにおいて,データ漏洩の大規模解析を行った。
参考スコア（独自算出の注目度）: 15.584759853972992
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are widely utilized in software engineering (SE) tasks, such as code generation and automated program repair. However, their reliance on extensive and often undisclosed pre-training datasets raises significant concerns about data leakage, where the evaluation benchmark data is unintentionally ``seen'' by LLMs during the model's construction phase. The data leakage issue could largely undermine the validity of LLM-based research and evaluations. Despite the increasing use of LLMs in the SE community, there is no comprehensive study that assesses the extent of data leakage in SE benchmarks for LLMs yet. To address this gap, this paper presents the first large-scale analysis of data leakage in 83 SE benchmarks concerning LLMs. Our results show that in general, data leakage in SE benchmarks is minimal, with average leakage ratios of only 4.8\%, 2.8\%, and 0.7\% for Python, Java, and C/C++ benchmarks, respectively. However, some benchmarks exhibit relatively higher leakage ratios, which raises concerns about their bias in evaluation. For instance, QuixBugs and BigCloneBench have leakage ratios of 100.0\% and 55.7\%, respectively. Furthermore, we observe that data leakage has a substantial impact on LLM evaluation. We also identify key causes of high data leakage, such as the direct inclusion of benchmark data in pre-training datasets and the use of coding platforms like LeetCode for benchmark construction. To address the data leakage, we introduce \textbf{LessLeak-Bench}, a new benchmark that removes leaked samples from the 83 SE benchmarks, enabling more reliable LLM evaluations in future research. Our study enhances the understanding of data leakage in SE benchmarks and provides valuable insights for future research involving LLMs in SE.
Abstract（参考訳）: 大規模言語モデル(LLM)は、コード生成やプログラムの自動修復といったソフトウェア工学(SE)タスクで広く利用されている。しかしながら、広範囲かつ頻繁に開示される事前トレーニングデータセットへの依存は、データ漏洩に関する重大な懸念を引き起こし、モデルの構築フェーズにおいて、評価ベンチマークデータは故意にLLMによって‘seen’されている。データ漏洩問題は、LLMに基づく研究と評価の妥当性を著しく損なう可能性がある。 SE コミュニティでの LLM の利用が増加しているにもかかわらず,LLM の SE ベンチマークにおけるデータ漏洩の程度を評価する包括的な研究はまだ行われていない。このギャップに対処するため,LLMに関する83SEベンチマークにおいて,データ漏洩の大規模解析を行った。この結果から,SEベンチマークにおける平均リーク率は,Python,Java,C/C++ベンチマークでそれぞれ4.8\%,2.8\%,0.7\%であった。しかし、いくつかのベンチマークでは比較的高いリーク率を示しており、評価におけるバイアスに関する懸念を提起している。例えば、QuixBugs と BigCloneBench はそれぞれ 100.0\% と 55.7\% である。さらに,データ漏洩がLLM評価に重大な影響を与えることも確認した。また、事前トレーニングデータセットにベンチマークデータを直接組み込むことや、ベンチマーク構築にLeetCodeのようなコーディングプラットフォームを使用することなど、高いデータ漏洩の原因も特定しています。データ漏洩に対処するために, 83 SEベンチマークから漏れたサンプルを除去する新しいベンチマークである \textbf{LessLeak-Bench} を導入する。本研究は,SEベンチマークにおけるデータ漏洩の理解を深め,SEにおけるLSMに関する今後の研究に有用な知見を提供する。

論文の概要: LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks

関連論文リスト