Fugu-MT 論文翻訳(概要): A New Benchmark for Evaluating Code Translation with Third-Party Libraries

論文の概要: A New Benchmark for Evaluating Code Translation with Third-Party Libraries

arxiv url: http://arxiv.org/abs/2509.12087v1
Date: Mon, 15 Sep 2025 16:16:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-16 17:26:23.390855
Title: A New Benchmark for Evaluating Code Translation with Third-Party Libraries
Title（参考訳）: サードパーティライブラリによるコード翻訳評価のための新しいベンチマーク
Authors: Pengyu Xue, Kunwu Zheng, Zhen Yang, Yifei Pei, Linhao Wu, Jiahui Dong, Xiapu Luo, Yan Xiao, Fei Liu, Yuxuan Zhang, Xiran Lyu, Xianhang Li, Xuanyu Zhu, Chengyi Wang,
Abstract要約: TransLibEvalはライブラリ中心のコード翻訳に特化した最初のベンチマークである。 Python、Java、C++にまたがる200の現実世界のタスクで構成されており、それぞれがデータ処理、機械学習、Web開発といったさまざまなカテゴリのTPLを明示的に含んでいる。商業・一般・コード特化家族の近年の7つのLCMを,直接・IR誘導・検索強化の6つのカテゴリの翻訳戦略に基づいて評価した。
参考スコア（独自算出の注目度）: 37.53966825335189
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: In recent years, Large Language Models (LLMs) have been widely studied in the code translation field on the method, class, and even repository levels. However, most of these benchmarks are limited in terms of Third-Party Library (TPL) categories and scales, making TPL-related errors hard to expose and hindering the development of targeted solutions. Considering the high dependence (over 90%) on TPLs in practical programming, demystifying and analyzing LLMs' code translation performance involving various TPLs becomes imperative. To address this gap, we construct TransLibEval, the first benchmark dedicated to library-centric code translation. It consists of 200 real-world tasks across Python, Java, and C++, each explicitly involving TPLs from diverse categories such as data processing, machine learning, and web development, with comprehensive dependency coverage and high-coverage test suites. We evaluate seven recent LLMs of commercial, general, and code-specialized families under six translation strategies of three categories: Direct, IR-guided, and Retrieval-augmented. Experimental results show a dramatic performance drop compared with library-free settings (average CA decline over 60%), while diverse strategies demonstrate heterogeneous advantages. Furthermore, we analyze 4,831 failed cases from GPT-4o, one of the State-of-the-Art (SOTA) LLMs, revealing numerous third-party reference errors that were obscured previously. These findings highlight the unique challenges of library-centric translation and provide practical guidance for improving TPL-aware code intelligence.
Abstract（参考訳）: 近年、LLM(Large Language Models)は、メソッド、クラス、リポジトリレベルに関するコード翻訳の分野で広く研究されている。しかしながら、これらのベンチマークのほとんどは、サードパーティライブラリ(TPL)のカテゴリとスケールの点で制限されており、TPL関連のエラーを露呈し、ターゲットとするソリューションの開発を妨げることは困難である。実用的なプログラミングにおけるTPLへの高い依存度(90%以上)を考えると、様々なTPLを含むLLMのコード翻訳性能のデミスタライズと解析が必須となる。このギャップに対処するため、図書館中心のコード翻訳に特化した最初のベンチマークであるTransLibEvalを構築した。 Python、Java、C++にまたがる200の現実世界のタスクで構成され、それぞれがデータ処理、機械学習、Web開発といったさまざまなカテゴリのTPLを明示的に含み、包括的な依存性カバレッジと高いカバレッジテストスイートを備えている。商業・一般・コード特化家族の近年の7つのLCMを,直接・IR誘導・検索強化の6つのカテゴリの翻訳戦略に基づいて評価した。実験の結果,ライブラリフリーな設定に比べて劇的な性能低下(平均CA減少率は60%)を示し,多種多様な戦略が不均一な利点を示している。さらに,SOTA (State-of-the-Art, State-the-Art) LLM) のひとつである GPT-4o から4,831件の故障事例を解析し,これまで不明であったサードパーティの参照エラーを多数明らかにした。これらの知見は図書館中心の翻訳の独特な課題を強調し、TPL対応のコードインテリジェンスを改善するための実践的なガイダンスを提供する。

論文の概要: A New Benchmark for Evaluating Code Translation with Third-Party Libraries

関連論文リスト