Fugu-MT 論文翻訳(概要): LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

論文の概要: LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

arxiv url: http://arxiv.org/abs/2509.12539v1
Date: Tue, 16 Sep 2025 00:41:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 17:50:52.823982
Title: LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations
Title（参考訳）: LEAF:教師対応表現を用いたテキスト埋め込みモデルの知識蒸留
Authors: Robin Vujanic, Thomas Rueckstiess,
Abstract要約: テキスト埋め込みモデルのための知識蒸留フレームワークであるLEAF(Lightweight Embedding Alignment Framework)を提案する。重要な特徴は、私たちの蒸留葉モデルが教師と一致していることです。これらの特性が教師モデルに存在すると、葉モデルがMRLとロバスト性を自動的に継承して量子化を出力することを示す。
参考スコア（独自算出の注目度）: 2.191505742658975
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present LEAF ("Lightweight Embedding Alignment Framework"), a knowledge distillation framework for text embedding models. A key distinguishing feature is that our distilled leaf models are aligned to their teacher. In the context of information retrieval, this allows for flexible asymmetric architectures where documents are encoded with the larger teacher model, while queries can be served with the smaller leaf models. We also show that leaf models automatically inherit MRL and robustness to output quantization whenever these properties are present in the teacher model, without explicitly training for them. To demonstrate the capability of our framework we publish leaf-ir, a 23M parameters information retrieval oriented text embedding model trained using LEAF, which sets a new state-of-the-art (SOTA) on BEIR, ranking #1 on the public leaderboard for this benchmark and for models of its size. When run in asymmetric mode, its retrieval performance is further increased. Our scheme is however not restricted to the information retrieval setting, and we demonstrate its wider applicability by synthesizing the multi-task leaf-mt model. This also sets a new SOTA, ranking #1 on the public MTEB v2 (English) leaderboard for its size. LEAF is applicable to black-box models and in contrast to other embedding model training frameworks, it does not require judgments nor hard negatives, and training can be conducted using small batch sizes. Thus, dataset and training infrastructure requirements for our framework are modest. We make our models publicly available under a permissive Apache 2.0 license.
Abstract（参考訳）: テキスト埋め込みモデルのための知識蒸留フレームワークであるLEAF(Lightweight Embedding Alignment Framework)を提案する。重要な特徴は、私たちの蒸留葉モデルが教師と一致していることです。情報検索の文脈では、より大きな教師モデルでドキュメントをエンコードし、より小さなリーフモデルでクエリを提供する、柔軟な非対称アーキテクチャを実現することができる。また,これらの特性が教師モデルに存在するたびに,MRLとロバスト性を自動で継承して量子化を出力することを示した。 LEAFを用いて学習した23Mパラメータ情報検索指向テキスト埋め込みモデルであるLeft-irをBEIR上に新たにSOTA(State-of-the-art)を設定し,このベンチマークとサイズモデルについて,パブリックリーダボードで第1位にランク付けした。非対称モードで実行すると、その検索性能はさらに向上する。しかし,本手法は情報検索設定に限らず,マルチタスクリーフ-mtモデルを合成して適用性を示す。また、新しいSOTAが設定され、MTEB v2(英語)のリーダーボードで1位にランクインした。 LEAFはブラックボックスモデルに適用でき、他の埋め込みモデルトレーニングフレームワークとは対照的に、判断や強みは必要とせず、小さなバッチサイズでトレーニングを行うことができる。したがって、私たちのフレームワークのデータセットとトレーニングのインフラ要件は控えめです。当社のモデルは、寛容なApache 2.0ライセンスの下で公開しています。

論文の概要: LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

関連論文リスト