Fugu-MT 論文翻訳(概要): Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs

論文の概要: Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs

arxiv url: http://arxiv.org/abs/2602.20799v1
Date: Tue, 24 Feb 2026 11:36:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-25 17:34:53.729352
Title: Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs
Title（参考訳）: Unseen-Codebases-Domain Data Synthesis and Training based on Code Graphs (特集:情報ネットワーク)
Authors: Guangsheng Ou, Qiming Zhang, Sirong Chen, Anji Li, Dong Xu, Tiancheng Luo, Dekun Dai, Cuiyun Gao, Long Wang, Jun Zhou, Mingwei Liu, Zibin Zheng,
Abstract要約: 大きな言語モデル(LLM)は、しばしば性能が悪く幻覚の頻度が高い。本研究では,アンセエンスから構築したコードグラフに基礎を置いた推論対応データ合成のための2段階のトレーニングフレームワークを提案する。我々はunseensのコード生成のための新しいベンチマークUnseenCodeBenchを紹介した。
参考スコア（独自算出の注目度）: 42.60617835497159
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the context of newly release software frameworks, large language models (LLMs) often exhibit poor performance and a high rate of hallucination, as they are not exposed to such environments during training. Although inference-time augmentation techniques such as retrieval-augmented generation (RAG) can partially mitigate hallucinations, knowledge injection through prompting alone is insufficient to enable models to fully understand the intrinsic relationships among different components of a codebase, or to reason about the correct compositions and apply. Although explicit knowledge injection can be achieved through post-training, compared with public code domains, unseen codebases typically provide only source code and lack large volumes of high-quality, usage-oriented code that can be directly leveraged as training data. Consequently, existing data synthesis approaches are insufficient to adequately capture unseen codebases usage scenarios when restricted to source code alone. To address these challenges, we propose UCD-Training, a two-stage training framework for reasoning-aware data synthesis grounded in a code graph constructed from unseen codebases. UCD-Training first parses the source code to build a code graph, then conducts dependency-preserving continued pretraining (CPT) using file-level dependency data, followed by graph-grounded supervised fine-tuning (SFT) on three types of synthesized data augmented with explicit reasoning traces: (1) single-hop relation reasoning data, (2) compositional API reasoning data, and (3) codebase utilization data. We further introduce a new benchmark, UnseenCodeBench, for code generation on unseen codebases and conduct comprehensive experiments across multiple codebases.
Abstract（参考訳）: 新たにリリースされるソフトウェアフレームワークでは、大規模な言語モデル(LLM)は、トレーニング中にそのような環境にさらされていないため、パフォーマンスが悪く、幻覚の頻度が高いことが多い。検索強化生成(RAG)のような推論時間増強技術は、部分的に幻覚を緩和することができるが、プロンプトによる知識注入は不十分であり、コードベースの異なるコンポーネント間の本質的な関係を完全に理解したり、正しい構成を推論したり、適用することができる。公開コードドメインと比較して、明示的な知識注入はポストトレーニングによって達成できるが、見知らぬコードベースは一般的にソースコードのみを提供し、トレーニングデータとして直接活用できる高品質な利用指向の大量のコードを欠いている。その結果、既存のデータ合成アプローチは、ソースコードのみに制限された場合、見知らぬコードベースの使用シナリオを適切にキャプチャするには不十分である。これらの課題に対処するために,未確認のコードベースから構築されたコードグラフ上に構築された推論対応データ合成のための2段階のトレーニングフレームワークであるUDD-Trainingを提案する。 UCD-Trainingは、まずソースコードを解析して、ファイルレベルの依存性データを使用して依存性保存継続事前トレーニング(CPT)を行い、続いて、明示的な推論トレースを付加した3種類の合成データ((1)シングルホップ関係推論データ、(2)コンポジションAPI推論データ、(3)コードベース利用データ)をグラフグラウンドで教師付き微調整(SFT)する。さらに、不明なコードベースのコード生成と、複数のコードベースにわたる包括的な実験を行うための、UnseenCodeBenchという新しいベンチマークも導入しています。

関連論文リスト

CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback [21.627909324788597]
大規模言語モデルの訓練には高品質な命令コードペアの獲得が不可欠である。 2つのLLMエージェント間の反復的な相互作用を通じてコードデータを合成するフレームワークであるCodeEvoを提案する。
論文参考訳（メタデータ） (2025-07-25T16:12:51Z)
Towards A Generalist Code Embedding Model Based On Massive Data Synthesis [35.04242699869519]
汎用コード検索のための最先端の埋め込みモデルである textbfCodeR (underlineCode underlineRetrieval) を導入する。 CodeRの優れたパフォーマンスは、DRU原則に基づいて構築された大規模な合成データセットであるCodeR-Pile上に構築されている。
論文参考訳（メタデータ） (2025-05-19T04:37:53Z)
Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealingは、事前訓練されたモデルの本質的な能力を同等に評価するために設計された、軽量で透明なトレーニング手法である。我々の経験的結果は、測定されたコードインテリジェンスとビット・パー・キャラクタ(BPC)の基本的な対数関係を明らかにする。私たちの研究は、コードインテリジェンスの開発における圧縮の役割をより微妙に理解し、コードドメインにおける堅牢な評価フレームワークに貢献します。
論文参考訳（メタデータ） (2025-05-16T16:59:14Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
大きな言語モデル(LLM)は、様々なタスクにおいて顕著な能力を示してきたが、コード生成は依然として大きな課題である。私たちは、モデル生成ユニットテストを活用してコード生成プロセスのガイドと検証を行う、システマティックパイプラインであるUnitCoderを紹介します。我々の研究は、モデル生成単体テストを利用して、事前学習コーパスから高品質なコードデータの合成を誘導するスケーラブルなアプローチを提案する。
論文参考訳（メタデータ） (2025-02-17T05:37:02Z)
EpiCoder: Encompassing Diversity and Complexity in Code Generation [66.43738008739555]
既存のコード生成方法はシードデータとしてコードスニペットを使用する。階層的なコード機能を中心に展開する,新しい機能ツリーベースの合成フレームワークを提案する。我々のフレームワークは、生成されたコードの複雑さを正確に制御し、関数レベルの操作からマルチファイルのシナリオまで幅広い機能を実現する。
論文参考訳（メタデータ） (2025-01-08T18:58:15Z)
Boosting Source Code Learning with Text-Oriented Data Augmentation: An Empirical Study [20.812886172494082]
そこで本研究では,自然言語テキストを対象としたデータ拡張手法の有効性について検討する。以上の結果から,より正確で堅牢なソースコード学習モデルを実現する具体的なデータ拡張手法が明らかとなった。
論文参考訳（メタデータ） (2023-03-13T01:47:05Z)
Soft-Labeled Contrastive Pre-training for Function-level Code Representation [127.71430696347174]
textbfSoft-labeled contrastive pre-training framework with two positive sample construction method。大規模コードコーパスにおけるコード間の関連性を考慮すると、ソフトラベル付きコントラスト付き事前学習は、きめ細かいソフトラベルを得ることができる。 SCodeRは、7つのデータセットで4つのコード関連タスクに対して、最先端のパフォーマンスを新たに達成する。
論文参考訳（メタデータ） (2022-10-18T05:17:37Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
コード検索のためのマルチモーダルコントラスト学習とソフトデータ拡張を用いた新しい手法を提案する。我々は,6つのプログラミング言語を用いた大規模データセットにおけるアプローチの有効性を評価するために,広範囲な実験を行った。
論文参考訳（メタデータ） (2022-04-07T08:49:27Z)
deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search [15.19181807445119]
ソースコードを変数ベースのフローグラフに変換する学習可能なDeGraph for Code Search(deGraphCSと呼ばれる)を提案する。 C言語で記述された41,152のコードスニペットを含む大規模なデータセットをGitHubから収集しています。
論文参考訳（メタデータ） (2021-03-24T06:57:44Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。