Fugu-MT 論文翻訳(概要): LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval

論文の概要: LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval

arxiv url: http://arxiv.org/abs/2603.26683v1
Date: Tue, 10 Mar 2026 13:25:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:13.070595
Title: LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval
Title（参考訳）: LITTA:視覚的マルチモーダル検索のための遅延相互作用とテスト時間アライメント
Authors: Seonok Kim,
Abstract要約: LITTAは、エビデンスページ検索のためのクエリ拡張中心の検索フレームワークである。ユーザクエリが与えられた後、LITTAは大きな言語モデルを使用して補完的なクエリ変種を生成する。拡張されたクエリからの候補は、エビデンスカバレッジを改善するために、相互のランクフュージョンを通じて集約される。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieving relevant evidence from visually rich documents such as textbooks, technical reports, and manuals is challenging due to long context, complex layouts, and weak lexical overlap between user questions and supporting pages. We propose LITTA, a query-expansion-centric retrieval framework for evidence page retrieval that improves multimodal document retrieval without retriever retraining. Given a user query, LITTA generates complementary query variants using a large language model and retrieves candidate pages for each variant using a frozen vision retriever with late-interaction scoring. Candidates from expanded queries are then aggregated through reciprocal rank fusion to improve evidence coverage and reduce sensitivity to any single phrasing. This simple test-time strategy significantly improves retrieval robustness while remaining compatible with existing multimodal embedding indices. We evaluate LITTA on visually grounded document retrieval tasks across three domains: computer science, pharmaceuticals, and industrial manuals. Multi-query retrieval consistently improves top-k accuracy, recall, and MRR compared to single-query retrieval, with particularly large gains in domains with high visual and semantic variability. Moreover, the accuracy-efficiency trade-off is directly controllable by the number of query variants, making LITTA practical for deployment under latency constraints. These results demonstrate that query expansion provides a simple yet effective mechanism for improving visually grounded multimodal retrieval.
Abstract（参考訳）: 教科書、テクニカルレポート、マニュアルなどの視覚的にリッチなドキュメントから関連する証拠を取得することは、長いコンテキスト、複雑なレイアウト、ユーザ質問とサポートページ間の弱い語彙的重複のために困難である。本稿では,エビデンスページ検索のためのクエリ拡張中心検索フレームワークであるLITTAを提案する。ユーザクエリが与えられた後、LITTAは大きな言語モデルを用いて補完的なクエリ変種を生成し、遅延応答スコア付き凍結視覚検索器を用いて各変種候補ページを検索する。拡張されたクエリからの候補は、相互のランクフュージョンを通じて集約され、エビデンスカバレッジを改善し、単一のフレーズに対する感度を低下させる。この単純なテストタイム戦略は、既存のマルチモーダル埋め込みインデックスとの互換性を維持しながら、検索の堅牢性を大幅に改善する。 LITTAは、コンピュータ科学、薬学、産業マニュアルの3分野にわたる、視覚的基盤化された文書検索タスクに基づいて評価する。マルチクエリ検索は、シングルクエリ検索と比較して、トップkの精度、リコール、MRRを一貫して改善する。さらに、精度と効率のトレードオフはクエリのバリエーションの数によって直接制御可能であるため、遅延制約下でのデプロイメントにはLITTAが有効である。これらの結果から、クエリ拡張は、視覚的に接地されたマルチモーダル検索を改善するための、シンプルで効果的なメカニズムを提供することが示された。

論文の概要: LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval

関連論文リスト