Fugu-MT 論文翻訳(概要): Multi-Vector Embeddings are Provably More Expressive than Single Vector Embeddings

論文の概要: Multi-Vector Embeddings are Provably More Expressive than Single Vector Embeddings

arxiv url: http://arxiv.org/abs/2606.23475v1
Date: Mon, 22 Jun 2026 15:22:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 18:51:26.429231
Title: Multi-Vector Embeddings are Provably More Expressive than Single Vector Embeddings
Title（参考訳）: 多ベクトル埋め込みは、おそらく単ベクトル埋め込みよりも表現力が高い
Authors: Rajesh Jayaram,
Abstract要約: MV埋め込みは1つのベクトル埋め込みで概略表現できない類似性を表現できることを示す。固定表現サイズでは、複数ベクトル埋め込みは1つのベクトル埋め込みでほぼ表現できない類似性を表現することができる。
参考スコア（独自算出の注目度）: 5.90432887297327
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-vector (MV) embeddings have become a powerful paradigm in neural information retrieval (IR), achieving high retrieval accuracy by representing data with multiple vectors and scoring them via the non-linear Chamfer similarity. Despite their widely perceived superiority over single-vector (SV) embeddings which use inner product similarity, to date there is no formal proof that SV similarities cannot approximate MV similarities with the same representation size. Specifically, we ask the following: for any bounded dataset size $n \leq 2^{poly(m)}$, what is the smallest dimension $D$ so that given any collection of MV embeddings $Q_1,\dots,Q_n,X_1,\dots,X_n \subset \mathbb{R}^d$ containing at most $m$ vectors each, there always exist $q_1,\dots,q_n$, $d_1,\dots,d_n \in \mathbb{R}^{D}$ satisfying $|\langle q_i, d_j \rangle - \texttt{Chamfer}(Q_i,X_j)| \leq ε$ for all $i,j$? Recently, the MUVERA algorithm demonstrated that $D = m^{O(1/ε^2)}$ is possible. If improved to $D = md$, this would imply that MV embeddings are no more expressive than SV embeddings. In this paper, we rule out this scenario. Specifically, we prove the existence of a collection of MV embeddings in $\mathbb{R}^d$, each containing at most $m$ vectors, which require single-vector dimension of $D =(ε^2 m)^{Ω(1/ε)}$ to approximate, establishing a strong separation in representation size between MV and SV embeddings. Our proof leverages the Pattern Matrix Method by constructing a hard instance whose Chamfer similarity matrix encodes the $NAND_k$ boolean function. Our results confirm a long-held belief in the IR community: at a fixed representation size, multi-vector embeddings can express similarities which cannot even be approximately represented by single vector embeddings.
Abstract（参考訳）: マルチベクトル(MV)埋め込みは、ニューラルネットワーク検索(IR)において強力なパラダイムとなり、複数のベクトルでデータを表現し、非線形のチャンファー類似性を通じてそれらをスコアリングすることで高い検索精度を実現している。内積類似性を用いた単ベクトル埋め込み(SV)よりも広く知覚されているにもかかわらず、SV類似性が同じ表現サイズでMV類似性を近似できないという公式な証明はない。 Q_1,\dots,Q_n,X_1,\dots,X_n \subset \mathbb{R}^d$が少なくとも$m$ベクトルを含むとすると、常に$q_1,\dots,q_n$, $d_1,\dots,d_n \in \mathbb{R}^{D}$が$|\langle q_i, d_j \rangle - \textt{Chamfer}(Q_i,X_j)| \leq ε$i,$j,$j を満たす。近年、MUVERAアルゴリズムは$D = m^{O(1/ε^2)}$が可能であることを示した。もし$D = md$に改善されれば、MV埋め込みはSV埋め込みほど表現力がないことを意味する。本稿では,このシナリオを除外する。具体的には、 MV 埋め込みの集合が $\mathbb{R}^d$ に存在し、それぞれが少なくとも $m$ のベクトルを含むことを証明し、これは 1-ベクトル次元が $D = (ε^2 m)^{Ω(1/ε)} で近似し、MV と SV の埋め込みの表現サイズを強く分離する。我々の証明は、Chamfer類似度行列が$NAND_k$ boolean関数をエンコードするハードインスタンスを構築することで、Pattern Matrix Methodを活用する。固定表現サイズでは、複数ベクトル埋め込みは1つのベクトル埋め込みでほぼ表現できない類似性を表現することができる。

論文の概要: Multi-Vector Embeddings are Provably More Expressive than Single Vector Embeddings

関連論文リスト