Fugu-MT 論文翻訳(概要): TRACY: Benchmarking Execution Efficiency of LLM-Based Code Translation

論文の概要: TRACY: Benchmarking Execution Efficiency of LLM-Based Code Translation

arxiv url: http://arxiv.org/abs/2508.11468v1
Date: Fri, 15 Aug 2025 13:33:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-18 14:51:23.965256
Title: TRACY: Benchmarking Execution Efficiency of LLM-Based Code Translation
Title（参考訳）: TRACY: LLMベースのコード翻訳のベンチマーク実行効率
Authors: Zhihao Gong, Zeyu Sun, Dong Huang, Qingyuan Liang, Jie M. Zhang, Dan Hao,
Abstract要約: LLM変換されたコードの実行効率を評価するために設計された,最初の総合ベンチマークであるTRACYを紹介する。ベンチマークの結果は、C++、Java、Pythonで1,011のコード変換タスクで構成されている。我々の研究は、将来のLLMベースのコード翻訳において、正確さと効率を共同最適化する必要性を浮き彫りにしている。
参考スコア（独自算出の注目度）: 15.302454413096335
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic code translation is a fundamental task in modern software development. While the advent of Large Language Models (LLMs) has significantly improved the correctness of code translation, the critical dimension of execution efficiency remains overlooked. To address this gap, we introduce TRACY, the first comprehensive benchmark designed to evaluate the execution efficiency of LLM-translated code. TRACY is constructed through an LLM-driven two-stage pipeline: an initial stage generates a suite of stress tests to amplify performance differences, followed by an efficiency-oriented task pruning stage that isolates the efficiency-distinguishing tasks. The resulting benchmark comprises 1,011 code translation tasks across C++, Java, and Python, each accompanied by an average of 22.1 verified reference translations and 10 computationally demanding tests. Our extensive evaluation of 26 representative LLMs reveals that even top-tier LLMs struggle to consistently produce efficient code translations. For instance, Claude-4-think, the leading model for correctness, ranks eighth overall when time efficiency is taken into account, surpassed by several smaller open-source models. We further pinpoint that algorithmic flaws and improper resource handling are the most detrimental, causing a median time slowdown of 5.6$\times$ and memory increase of 12.0$\times$, respectively. Our work underscores the necessity of jointly optimizing for correctness and efficiency in future LLM-based code translation.
Abstract（参考訳）: 自動コード翻訳は、現代のソフトウェア開発における基本的なタスクである。 LLM(Large Language Models)の出現はコード翻訳の正確性を大幅に向上させたが、実行効率の重要な次元は見過ごされ続けている。このギャップに対処するため, TRACY は LLM 変換コードの実行効率を評価するために設計された最初の総合的なベンチマークである。 TRACYはLLM駆動の2段階パイプラインで構築されており、初期ステージはパフォーマンスの違いを増幅する一連のストレステストを生成し、続いて効率性に配慮したタスクプルーニングステージが実行され、効率性の識別タスクが分離される。ベンチマークの結果、C++、Java、Pythonにまたがる1,011のコード翻訳タスクで構成されており、それぞれに平均22.1の検証済み参照翻訳と10の計算要求テストが伴っている。 26の代表的なLLMを広範囲に評価した結果,最上位のLLMでさえ,一貫した効率のよいコード翻訳に苦慮していることが明らかとなった。例えば、正確性の主要なモデルであるClaude-4-thinkは、時間効率を考慮して総合的に8位にランクされ、いくつかのより小さなオープンソースモデルに取って代わられている。さらに、アルゴリズム上の欠陥と不適切なリソースハンドリングが最も有害であり、それぞれ5.6$\times$と12.0$\times$が中央値のスローダウンを引き起こしていることを指摘します。我々の研究は、将来のLLMベースのコード翻訳において、正確さと効率を共同最適化する必要性を浮き彫りにしている。

論文の概要: TRACY: Benchmarking Execution Efficiency of LLM-Based Code Translation

関連論文リスト