Fugu-MT 論文翻訳(概要): Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector

論文の概要: Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector

arxiv url: http://arxiv.org/abs/2510.00671v1
Date: Wed, 01 Oct 2025 08:58:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.477004
Title: Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector
Title（参考訳）: Milco: 多言語接続子による言語横断のスパース検索
Authors: Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, Andrew Yates,
Abstract要約: Learned Sparse Retrieval (LSR) は、2エンコーダの効率と語彙マッチングの透明性を組み合わせている。 MILCOは、異なる言語からのクエリやドキュメントを共通の英語語彙空間にマッピングするLSRアーキテクチャである。
参考スコア（独自算出の注目度）: 25.65114670027799
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learned Sparse Retrieval (LSR) combines the efficiency of bi-encoders with the transparency of lexical matching, but existing approaches struggle to scale beyond English. We introduce MILCO, an LSR architecture that maps queries and documents from different languages into a shared English lexical space via a multilingual connector. MILCO is trained with a specialized two-stage regime that combines Sparse Alignment Pretraining with contrastive training to provide representation transparency and effectiveness while mitigating semantic collapse. Motivated by the observation that uncommon entities are often lost when projected into English, we propose a new LexEcho head, which enhances robustness by augmenting the English lexical representation with a source-language view obtained through a special [ECHO] token. MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed on standard multilingual benchmarks, while supporting dynamic efficiency through post-hoc pruning. Notably, when using mass-based pruning to reduce document representations to only 30 active dimensions on average, MILCO 560M outperforms the similarly-sized Qwen3-Embed 0.6B with 1024 dimensions.
Abstract（参考訳）: Learned Sparse Retrieval (LSR)は、二エンコーダの効率と語彙マッチングの透明性を組み合わせたものだが、既存のアプローチは英語以外の拡張に苦慮している。 MILCOは、異なる言語からのクエリやドキュメントを多言語コネクタを介して共有英語語彙空間にマッピングするLSRアーキテクチャである。 MILCOは、スパースアライメント・プレトレーニングとコントラストトレーニングを組み合わせて、意味的崩壊を緩和しながら、表現の透明性と有効性を提供する2段階の特別な制度で訓練されている。特殊(ECHO)トークンを用いて英語の語彙表現を拡大することにより、ロバスト性を高めるLexEchoヘッドを提案する。 MILCOは最先端のマルチリンガルとクロスリンガルのLSR性能を実現し、BGE-M3やQwen3-Embedのような高密度、スパース、マルチベクターのベースラインを標準のマルチリンガルベンチマークで上回り、ポストホットプルーニングによって動的効率をサポートする。特に、文書表現を平均で30個のアクティブ次元に減らすためにマスベースプルーニングを使用する場合、MILCO 560Mは1024次元のQwen3-Embed 0.6Bと同等の大きさのQwen3-Embed 0.6Bより優れている。

論文の概要: Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector

関連論文リスト