Fugu-MT 論文翻訳(概要): LLM2Vec-Gen: Generative Embeddings from Large Language Models

論文の概要: LLM2Vec-Gen: Generative Embeddings from Large Language Models

arxiv url: http://arxiv.org/abs/2603.10913v1
Date: Wed, 11 Mar 2026 15:58:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:33.03992
Title: LLM2Vec-Gen: Generative Embeddings from Large Language Models
Title（参考訳）: LLM2Vec-Gen: 大規模言語モデルからの生成的埋め込み
Authors: Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach, Siva Reddy,
Abstract要約: 埋め込みモデルを訓練するための,新たな自己教師型アプローチを提案する。入力を符号化するのではなく、モデルの潜在的な応答を表現することを学ぶ。有害なコンテンツ検索の43.2%の削減と、埋め込みタスクの推論能力の29.3%の改善を観察する。
参考スコア（独自算出の注目度）: 38.742293185880364
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.
Abstract（参考訳）: LLMベースのテキスト埋め込みは、通常、入力のセマンティックな内容をエンコードする。しかし、埋め込みタスクは様々な入力を類似の出力にマッピングする必要がある。通常、このインプット・アウトプットは、コントラスト学習を用いてペアデータによる埋め込みモデルのトレーニングによって対処される。本研究では,入力を符号化するのではなく,モデルの潜在的な応答を表現することを学習する,新たな自己教師型アプローチ LLM2Vec-Gen を提案する。具体的には、LLMの語彙にトレーニング可能な特別なトークンを追加し、それらを入力に追加し、LLMの応答を固定長シーケンスで表現するように最適化する。訓練は、LLM自身のクエリの完了と、蒸留ターゲットを提供する教師なしの埋め込み教師によって指導される。この定式化は、入力出力ギャップを埋め、安全アライメントや推論などのLLM機能を組み込みタスクに転送するのに役立つ。重要なこととして、LLMバックボーンは凍結され、トレーニングはラベルのないクエリのみを必要とする。 LLM2Vec-Genは、MTEB(Massive Text Embedding Benchmark)上で最先端の自己教師型パフォーマンスを実現し、教師なしの埋め込み教師よりも9.3%向上した。また、有害なコンテンツ検索の43.2%の削減と、埋め込みタスクの推論能力の29.3%の改善も観察した。最後に、学習した埋め込みは解釈可能で、テキストにデコードして、セマンティックな内容を明らかにすることができる。

論文の概要: LLM2Vec-Gen: Generative Embeddings from Large Language Models

関連論文リスト