Fugu-MT 論文翻訳(概要): Towards A Generalist Code Embedding Model Based On Massive Data Synthesis

論文の概要: Towards A Generalist Code Embedding Model Based On Massive Data Synthesis

arxiv url: http://arxiv.org/abs/2505.12697v1
Date: Mon, 19 May 2025 04:37:53 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-20 14:57:11.406734
Title: Towards A Generalist Code Embedding Model Based On Massive Data Synthesis
Title（参考訳）: 大量データ合成に基づく汎用コード埋め込みモデルに向けて
Authors: Chaofan Li, Jianlyu Chen, Yingxia Shao, Defu Lian, Zheng Liu,
Abstract要約: 汎用コード検索のための最先端の埋め込みモデルである textbfCodeR (underlineCode underlineRetrieval) を導入する。 CodeRの優れたパフォーマンスは、DRU原則に基づいて構築された大規模な合成データセットであるCodeR-Pile上に構築されている。
参考スコア（独自算出の注目度）: 35.04242699869519
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code embedding models attract increasing attention due to the widespread popularity of retrieval-augmented generation (RAG) in software development. These models are expected to capture the rich semantic relationships inherent to code, which differ significantly from those found in text. However, existing models remain severely limited due to the scarcity of high-quality training data. In this work, we introduce \textbf{CodeR} (\underline{Code} \underline{R}etrieval), a state-of-the-art embedding model for general-purpose code retrieval. The superior performance of CodeR is built upon CodeR-Pile, a large-scale synthetic dataset constructed under the DRU (Diversity, Reliability, Usability) principle via a novel data synthesis pipeline. To optimize training effectiveness, we propose Annealing, a curriculum learning strategy that enables effective knowledge transfer across heterogeneous sources of data. We evaluate CodeR based on 16 diverse code retrieval tasks, where it significantly outperforms existing baselines and exhibits strong out-of-domain generalization performance. We have publicly released our code and the well-trained model to facilitate further research in this critical area. https://github.com/FlagOpen/FlagEmbedding/tree/master/research/BGE_Coder.
Abstract（参考訳）: コード埋め込みモデルは、ソフトウェア開発における検索強化世代(RAG)の普及により、注目を集めている。これらのモデルは、コード固有のリッチなセマンティックな関係を捉えることが期待されている。しかし、既存のモデルは、高品質のトレーニングデータが不足しているため、厳格に制限されている。本稿では,汎用コード検索のための組込みモデルであるtextbf{CodeR} (\underline{Code} \underline{R}etrieval)を紹介する。 CodeRの優れたパフォーマンスは、新しいデータ合成パイプラインを通じてDRU(Diversity, Reliability, Usability)原則に基づいて構築された大規模な合成データセットであるCodeR-Pile上に構築されている。学習効率を最適化するために、異種データソース間の効果的な知識伝達を可能にするカリキュラム学習戦略であるAnnealingを提案する。我々は16の多様なコード検索タスクに基づいてCodeRを評価し、既存のベースラインを著しく上回り、ドメイン外一般化性能を示す。私たちは、この重要な領域のさらなる研究を促進するために、コードと十分に訓練されたモデルを公開した。 https://github.com/FlagOpen/FlagEmbedding/tree/master/research/BGE_Coder。

論文の概要: Towards A Generalist Code Embedding Model Based On Massive Data Synthesis

関連論文リスト