Fugu-MT 論文翻訳(概要): Is Dimensionality a Barrier for Retrieval Models?

論文の概要: Is Dimensionality a Barrier for Retrieval Models?

arxiv url: http://arxiv.org/abs/2605.23556v1
Date: Fri, 22 May 2026 12:22:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.34056
Title: Is Dimensionality a Barrier for Retrieval Models?
Title（参考訳）: 次元は検索モデルにとって障壁か?
Authors: Kiril Bangachev, Guy Bresler, Jonathan Kogan, Yury Polyanskiy,
Abstract要約: 次元の制限のない最良のマージンである$mathsfmmathsfrd(+infty, A)-2log n)$は次元$d = O(mathsfmmathsfrd(+infty, A)-2log n)$でほぼ達成可能であることを示す。我々の主定理は、次元の制限なしに可能な最良のマージンである$mathsfmmathsfrd(+infty, A)-2log n)$が成り立つことを証明している。
参考スコア（独自算出の注目度）: 21.1705493494434
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Why does the low dimensionality of representations, typically $d\approx 1000$, not prevent modern embedding-based retrieval models from scaling to billions, or even trillions, of data points? To answer this question, we study maximal-margin embeddings in the following retrieval model, classically studied in communication complexity [PS86] and more recently in embedding-based retrieval [WBNL26]. Let $A\in \{0,1\}^{N\times n}$ be a matrix indicating whether each of $N$ queries is relevant to each of $n$ documents. We are interested in the largest margin $m>0,$ denoted by $\mathsf{m}^{\mathsf{rd}}(d, A),$ for which there exist unit norm embeddings of the queries and documents $\{U_j\}_{j = 1}^N, \{V_i\}_{i = 1}^n$ with the following property. $\langle U_j, V_i\rangle \ge m$ whenever $A_{ji} = 1$ and $\langle U_j, V_i\rangle \le -m$ otherwise. A large margin is a key proxy for representation quality: it controls both robustness to perturbations and compositional generalization across queries. Our main theorem establishes that the best possible margin without a restriction on the dimension, $\mathsf{m}^{\mathsf{rd}}(+\infty, A),$ can be nearly achieved in dimension $d = O(\mathsf{m}^{\mathsf{rd}}(+\infty, A)^{-2}\log n)$ which improves a theorem of [BDES02]. Together with a matching lower bound in Theorem 1.5, we conclude that when $A\in \{0,1\}^{\binom{n}{k}\times n}$ is the matrix containing all possible $k$-sparse rows once, dimension $d = O(k\log (n/k))$ is necessary and sufficient for the maximal possible margin $\mathsf{m}^{\mathsf{rd}}(+\infty, A) = Θ(k^{-1/2})$ in this setting. This fully resolves the setup of [WBNL26]. We also give several constructions for large margins when $d = o(k\log (n/k)).$ Finally, we empirically test the InfoNCE and sigmoid losses for producing large margin embeddings and demonstrate a clear advantage of the sigmoid loss.
Abstract（参考訳）: 表現の低次元性、通常$d\approx 1000$は、現代の埋め込みベースの検索モデルが数十億、あるいは数兆のデータポイントにスケールすることを妨げないのか? そこで本研究では,次の検索モデルにおける最大マージンの埋め込みについて検討し,従来の通信複雑性 [PS86] および最近では埋め込みベース検索 (WBNL26) について研究している。 A\in \{0,1\}^{N\times n}$を、$N$クエリのそれぞれが$n$ドキュメントに関連するかどうかを示す行列とする。最大のマージン $m>0,$ は $\mathsf{m}^{\mathsf{rd}}(d, A)$ で表されるが、クエリとドキュメントの単位ノルム埋め込みは $\{U_j\}_{j = 1}^N, \{V_i\}_{i = 1}^n$ である。 $\langle U_j, V_i\rangle \ge m$ whenever $A_{ji} = 1$ and $\langle U_j, V_i\rangle \le -m$ 大きなマージンは表現品質の鍵となるプロキシであり、摂動に対する堅牢性とクエリ間の合成一般化の両方を制御する。我々の主定理は、次元上の制限のない最良のマージンである$\mathsf{m}^{\mathsf{rd}}(+\infty, A)$は、[BDES02] の定理を改善する次元 $d = O(\mathsf{m}^{\mathsf{rd}}(+\infty, A)^{-2}\log n)$ でほぼ達成可能であることを証明している。 A\in \{0,1\}^{\binom{n}{k}\times n}$ がすべての可能な$k$スパース列を一度に含む行列であるとき、次元 $d = O(k\log (n/k))$ は必要であり、この設定では最大余剰$\mathsf{m}^{\mathsf{rd}}(+\infty, A) = s(k^{-1/2})$ に対して十分である。これは[WBNL26]のセットアップを完全に解決する。 d = o(k\log (n/k)) であるとき、大きなマージンに対していくつかの構成を与える。最後に、我々は、大きなマージン埋め込みを生成するためにInfoNCEとSigmoid損失を経験的にテストし、Sigmoid損失の明確な利点を示す。

論文の概要: Is Dimensionality a Barrier for Retrieval Models?

関連論文リスト