Fugu-MT 論文翻訳(概要): UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

論文の概要: UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

arxiv url: http://arxiv.org/abs/2603.29897v1
Date: Sun, 08 Feb 2026 12:39:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:13.173467
Title: UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates
Title（参考訳）: UniRank: エンド・ツー・エンドのドメイン--ハイブリッドなテキスト画像候補の再評価
Authors: Yupei Yang, Lin Yang, Wanxi Deng, Lin Qu, Shikui Tu, Lei Xu,
Abstract要約: テキストリランカは、画像候補よりも本質的にテキスト候補に近づき、バイアスと準最適のクロスモーダルランキングをもたらす。モダリティ変換を伴わないハイブリッドテキストイメージ候補のスコア付けと順序付けを行う,VLMベースのリグレードフレームワークであるUniRankを提案する。科学文献検索とデザイン特許検索の実験は、UniRankが最先端のベースラインを一貫して上回っていることを示している。
参考スコア（独自算出の注目度）: 19.175171858134632
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image candidates, leading to biased and suboptimal cross-modal ranking. Vision-language models (VLMs) mitigate this gap through strong cross-modal alignment and have recently been adopted to build multimodal rerankers. However, most VLM-based rerankers encode all candidates as images, and treating text as images introduces substantial computational overhead. Meanwhile, existing open-source multimodal rerankers are typically trained on general-domain data and often underperform in domain-specific scenarios. To address these limitations, we propose UniRank, a VLM-based reranking framework that natively scores and orders hybrid text-image candidates without any modality conversion. Building on this hybrid scoring interface, UniRank provides an end-to-end domain adaptation pipeline that includes: (1) an instruction-tuning stage that learns calibrated cross-modal relevance scoring by mapping label-token likelihoods to a unified scalar score; and (2) a hard-negative-driven preference alignment stage that constructs in-domain pairwise preferences and performs query-level policy optimization through reinforcement learning from human feedback (RLHF). Extensive experiments on scientific literature retrieval and design patent search demonstrate that UniRank consistently outperforms state-of-the-art baselines, improving Recall@1 by 8.9% and 7.3%, respectively.
Abstract（参考訳）: リグレードは多くの情報検索パイプラインにおいて重要なコンポーネントである。テキストのみの設定の顕著な進歩にもかかわらず、特に候補セットがハイブリッドテキストとイメージアイテムを含んでいる場合、マルチモーダルリランクは依然として困難である。テキストリランカは、画像候補よりも本質的にテキスト候補に近づき、バイアスと準最適のクロスモーダルランキングに繋がる。視覚言語モデル(VLM)は、このギャップを強力なクロスモーダルアライメントを通じて緩和し、最近マルチモーダルリランカの構築に採用されている。しかしながら、ほとんどのVLMベースのリランカは、すべての候補を画像としてエンコードし、テキストを画像として扱うことは、かなりの計算オーバーヘッドをもたらす。一方、既存のオープンソースのマルチモーダルリランカは、通常、一般的なドメインデータに基づいて訓練され、ドメイン固有のシナリオではパフォーマンスが劣る。これらの制約に対処するため、VLMベースのリグレードフレームワークであるUniRankを提案する。このハイブリッドスコアリングインタフェース上に構築されたUniRankは,(1)ラベルトークンの確率を統一スカラースコアにマッピングすることで,校正されたクロスモーダルレバレンススコアを学習するインストラクションチューニングステージ,(2)ドメイン内の相互選好を構築し,人間のフィードバック(RLHF)からの強化学習を通じてクエリレベルのポリシー最適化を行うハードネガティブな優先調整ステージを含む,エンドツーエンドのドメイン適応パイプラインを提供する。科学文献検索とデザイン特許検索に関する大規模な実験により、UniRankは最先端のベースラインを一貫して上回り、Recall@1を8.9%改善し、7.3%改善した。

論文の概要: UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

関連論文リスト