Fugu-MT 論文翻訳(概要): TIT: A Tree-Structured Instruction Tuning Approach for LLM-Based Code Translation

論文の概要: TIT: A Tree-Structured Instruction Tuning Approach for LLM-Based Code Translation

arxiv url: http://arxiv.org/abs/2510.09400v1
Date: Fri, 10 Oct 2025 13:53:46 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:49.196092
Title: TIT: A Tree-Structured Instruction Tuning Approach for LLM-Based Code Translation
Title（参考訳）: TIT:LLMに基づくコード翻訳のための木構造命令チューニング手法
Authors: He Jiang, Yufu Wang, Hao Lin, Peiyu Zou, Zhide Zhou, Ang Jia, Xiaochen Li, Zhilei Ren,
Abstract要約: LLMに基づくコード翻訳のためのツリー構造化命令チューニングパラダイムであるTITを提案する。構文的混乱を軽減するため、構文的情報表現モジュールは言語に依存しない構文的特徴を統合する。高品質の粒度並列データを生成するために、細粒度並列データセット拡張モジュールは、ノードとコードセグメントを整列する。
参考スコア（独自算出の注目度）: 11.882496324328905
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have shown strong performance in automated source-to-target code translation through pretraining on extensive code corpora. However, mainstream LLM-based code translation methods suffer from two critical limitations. First, they are highly sensitive to language-specific features, which often introduce source-language syntax or lexicon into the output, leading to syntactic confusion. Second, they lack fine-grained semantic alignment due to an over-reliance on function-level parallel datasets, resulting in semantic misalignment between the translated code and the original source. To overcome these limitations, we propose TIT, a Tree-structured Instruction Tuning paradigm for LLM-based code translation. Specifically, TIT consists of three modules. First, to mitigate syntactic confusion, the syntactic information representation module integrates language-agnostic syntactic features via structured parsing. Then, to generate high-quality fine-grained parallel data, the fine-grained parallel dataset augmentation module aligns nodes with code segments through statement-level segmentation and contrastive matching. Finally, we leverage the dual-stage tree instruction tuning module to alleviate the contextual processing burden on the LLM caused by the introduction of syntactic information. The first stage employs syntax-aware fine-tuning to enable the LLM to autonomously comprehend structured syntactic information, while the second stage utilizes code generation fine-tuning to guide the model in generating accurate target code based on function-level syntactic dependencies. The experimental results demonstrate that the proposed method significantly outperforms existing approaches in multiple LLMs, achieving a success rate 1.22x-1.75x higher in code translation while markedly reducing syntactic confusion.
Abstract（参考訳）: 大規模言語モデル(LLM)は、広範囲なコードコーパスの事前トレーニングを通じて、ソースからターゲットへの自動コード翻訳において、強力なパフォーマンスを示している。しかし、主要なLLMベースのコード翻訳法には2つの限界がある。まず、言語固有の機能に非常に敏感で、しばしばソース言語の構文や語彙を出力に導入し、構文的混乱を引き起こす。第二に、関数レベルの並列データセットの過度な信頼性のため、細粒度のセマンティックアライメントが欠如しているため、翻訳されたコードと元のソースとのセマンティックアライメントが相違する。これらの制限を克服するために,LLMに基づくコード翻訳のためのツリー構造化命令チューニングパラダイムであるTITを提案する。具体的には、TITは3つのモジュールから構成される。まず、構文的混乱を軽減するために、構文的情報表現モジュールは構造化解析を通して言語に依存しない構文的特徴を統合する。そして、高品質な並列データを生成するために、細粒度並列データセット拡張モジュールは、ステートメントレベルセグメンテーションとコントラストマッチングを通じて、ノードとコードセグメントを整列する。最後に,構文情報の導入によるLLMの文脈的処理負担を軽減するために,二段木命令チューニングモジュールを利用する。第1段階は構文認識の微調整を使用して、LLMが構造的構文情報を自律的に理解できるようにし、第2段階はコード生成の微調整を使用して、関数レベルの構文依存に基づいて正確なターゲットコードを生成する。実験の結果,提案手法は複数のLLMにおいて既存の手法よりも優れており,コード翻訳における成功率 1.22x-1.75x が高く,構文的混乱を著しく低減していることがわかった。

論文の概要: TIT: A Tree-Structured Instruction Tuning Approach for LLM-Based Code Translation

関連論文リスト