Fugu-MT 論文翻訳(概要): Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings

論文の概要: Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings

arxiv url: http://arxiv.org/abs/2509.12892v1
Date: Tue, 16 Sep 2025 09:48:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 17:50:53.013255
Title: Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings
Title（参考訳）: Conan-Embedding-v2: テキスト埋め込みのためのスクラッチからLLMのトレーニング
Authors: Shiyu Li, Yang Tang, Ruijie Liu, Shi-Zhe Chen, Xi Chen,
Abstract要約: 大規模言語モデル(LLM)は、最近、テキスト埋め込みタスクにおいて優れたパフォーマンスを示している。本研究では,スクラッチからトレーニングし,テキスト埋め込みとして微調整した新しい1.4BパラメータであるConan-embedding-v2を紹介する。直感的で有効であり、約1.4Bのパラメータしか持たないConan-embedding-v2は、MTEB(Massive Text Embedding Benchmark)と中国のMTEB(2025年5月19日)でSOTA性能を達成する。
参考スコア（独自算出の注目度）: 25.724646707322986
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).
Abstract（参考訳）: 大規模言語モデル(LLM)は、最近、テキスト埋め込みタスクにおいて優れたパフォーマンスを示している。従来の作業では、LoRAを使用して既存のLLMを微調整するが、これはLLMと埋め込みモデルの間のデータとトレーニングのギャップによって制限される。本研究では,スクラッチからトレーニングし,テキスト埋め込みとして微調整した新しい1.4BパラメータLLMであるConan-embedding-v2を紹介する。まず、LLM事前学習のためのニュースデータと多言語ペアを追加し、データギャップを埋める。そこで本研究では,LLMが言語間の埋め込みをよりよく統合することのできる言語間検索データセットを提案する。第二に、LLMはトークンレベルの損失を持つ因果マスクを使用するのに対し、埋め込みモデルは文レベルの損失を持つ双方向マスクを使用する。このトレーニングギャップにより、完全な微調整はLoRAよりも効果が低い。これら2種類のマスクを段階的に遷移させるソフトマスキング機構を導入し,より包括的な表現の学習を可能にした。そこで本研究では,トレーニング過程を通じて,より難易度の高い負の例にモデルを露出させる動的ハード負のマイニング手法を提案する。直感的で有効であり、約1.4Bのパラメータしか持たないConan-embedding-v2は、Massive Text Embedding Benchmark(MTEB)と中国のMTEB(2025年5月19日)の両方でSOTA性能を達成する。

論文の概要: Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings

関連論文リスト