Fugu-MT 論文翻訳(概要): On Strengths and Limitations of Single-Vector Embeddings

論文の概要: On Strengths and Limitations of Single-Vector Embeddings

arxiv url: http://arxiv.org/abs/2603.29519v1
Date: Tue, 31 Mar 2026 10:04:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:03.478208
Title: On Strengths and Limitations of Single-Vector Embeddings
Title（参考訳）: 単ベクトル埋め込みの強度と限界について
Authors: Archish S, Mihir Agarwal, Ankit Garg, Neeraj Kayal, Kirankumar Shiragur,
Abstract要約: 次元性だけでは観察された失敗を説明できないことを示す。ドメインシフトと、埋め込み類似性とタスクの基本的な概念との相違が、主要なコントリビュータであることに気付きました。
参考スコア（独自算出の注目度）: 13.712240635014775
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent work (Weller et al., 2025) introduced a naturalistic dataset called LIMIT and showed empirically that a wide range of popular single-vector embedding models suffer substantial drops in retrieval quality, raising concerns about the reliability of single-vector embeddings for retrieval. Although (Weller et al., 2025) proposed limited dimensionality as the main factor contributing to this, we show that dimensionality alone cannot explain the observed failures. We observe from results in (Alon et al., 2016) that $2k+1$-dimensional vector embeddings suffice for top-$k$ retrieval. This result points to other drivers of poor performance. Controlling for tokenization artifacts and linguistic similarity between attributes yields only modest gains. In contrast, we find that domain shift and misalignment between embedding similarities and the task's underlying notion of relevance are major contributors; finetuning mitigates these effects and can improve recall substantially. Even with finetuning, however, single-vector models remain markedly weaker than multi-vector representations, pointing to fundamental limitations. Moreover, finetuning single-vector models on LIMIT-like datasets leads to catastrophic forgetting (performance on MSMARCO drops by more than 40%), whereas forgetting for multi-vector models is minimal. To better understand the gap between performance of single-vector and multi-vector models, we study the drowning in documents paradox (Reimers \& Gurevych, 2021; Jacob et al., 2025): as the corpus grows, relevant documents are increasingly "drowned out" because embedding similarities behave, in part, like noisy statistical proxies for relevance. Through experiments and mathematical calculations on toy mathematical models, we illustrate why single-vector models are more susceptible to drowning effects compared to multi-vector models.
Abstract（参考訳）: 最近の研究 (Weller et al , 2025) では、LIMITと呼ばれる自然主義的なデータセットを導入し、一般的な単一ベクトル埋め込みモデルが検索品質の大幅な低下に悩まされ、単一のベクトル埋め込みの信頼性への懸念が高まっていることを実証的に示した。 Weller et al , 2025) はこれに寄与する主要因として, 有限次元性を提案したが, 観測された失敗は, 次元性だけでは説明できないことを示した。我々は (Alon et al , 2016) の結果から, 2k+1$-dimensional vector embeddeds suffices suffices for top-k$ search。この結果は、パフォーマンスの悪い他のドライバを指し示します。トークン化アーティファクトの制御と属性間の言語的類似性は、控えめな利得しか得られない。対照的に、埋め込み類似性とタスクの関連性の概念との間のドメインシフトとミスアライメントは主要な貢献者であり、微調整はこれらの効果を緩和し、リコールを大幅に改善することができる。しかし、微調整であっても、単一ベクトルモデルはマルチベクトル表現よりも著しく弱いままであり、基本的な限界を示している。さらに、LIMITライクなデータセット上で単一ベクトルモデルを微調整すると、破滅的な忘れ(MSMARCOの性能は40%以上低下する)につながるが、マルチベクトルモデルの忘れは最小限である。単一ベクトルモデルとマルチベクトルモデルのパフォーマンスのギャップをよりよく理解するために、パラドックス(Reimers \& Gurevych, 2021; Jacob et al , 2025): コーパスが成長するにつれて、関連する文書はますます「ドラッグアウト」されている。おもちゃの数学的モデルに関する実験と数理計算を通じて、単一ベクトルモデルがマルチベクトルモデルと比較して溺れの影響を受けやすい理由を説明している。

論文の概要: On Strengths and Limitations of Single-Vector Embeddings

関連論文リスト