Fugu-MT 論文翻訳(概要): NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

論文の概要: NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

arxiv url: http://arxiv.org/abs/2603.12824v1
Date: Fri, 13 Mar 2026 09:24:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:12.023902
Title: NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval
Title（参考訳）: NanoVDR:視覚文書検索のための700万テキスト専用エンコーダに2Bビジョンランゲージレトリバーを蒸留
Authors: Zhuchenyang Liu, Yao Zhang, Yu Xiao,
Abstract要約: Vision-Language Model (VLM)ベースのレトリバーは、高度な視覚文書検索(VDR)によって印象的な品質を実現している。文書は視覚的に複雑であり、強い視覚的理解を必要とするのに対し、クエリは単なる短い文字列である。 NanoVDRはこのクエリ-ドキュメント非対称性を利用して、2つのエンコーディングパスを分離する。問合せテキストのコサインアライメントは、ランクベースやコントラストの代替よりも一貫して優れています。言語間転送を主要なパフォーマンスボトルネックとして認識し、機械翻訳クエリによるトレーニングデータの拡大により、安価に解決する。
参考スコア（独自算出の注目度）: 8.720698253117837
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.
Abstract（参考訳）: Vision-Language Model (VLM)ベースのレトリバーは、高度な視覚文書検索(VDR)によって印象的な品質を実現している。ドキュメントインデックスとクエリエンコーディングの両方に同じマルチビリオンパラメータエンコーダが必要です。文書は視覚的に複雑であり、強い視覚的理解を必要とするのに対して、クエリは単なる短い文字列である。凍結された2B VLM教師は文書をオフラインにインデックスし、蒸留されたテキストのみの学生は69Mのパラメータで推論時にクエリをエンコードする。主要な設計上の選択は蒸留の目的である。 3つのバックボーンと22のViDoReベンチマークデータセットの6つの目標を体系的に比較した結果、クエリテキストに対するポイントワイドなコサインアライメントは、ランキングベースとコントラストのある代替品より一貫して優れており、事前キャッシュされた教師クエリの埋め込みとトレーニング中のドキュメント処理が不要であることがわかった。さらに,言語間移動を主要な性能ボトルネックとして認識し,機械翻訳クエリによるトレーニングデータの拡大により,安価に解決する。結果として生まれたNanoVDR-S-Multi (DistilBERT, 69M)は、教師の質の95.1\%を保持し、v2とv3のDSE-Qwen2(2B)を32$\times$少ないパラメータと50$\times$低いCPUクエリレイテンシで上回っている。

論文の概要: NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

関連論文リスト