Fugu-MT 論文翻訳(概要): WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

論文の概要: WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

arxiv url: http://arxiv.org/abs/2602.23029v1
Date: Thu, 26 Feb 2026 14:11:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-27 18:41:22.715513
Title: WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
Title（参考訳）: WISER: より広い検索、より深い思考、適応的な融合によるゼロショット合成画像検索
Authors: Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang,
Abstract要約: ZS-CIRは、マルチモーダルクエリがアノテートされたトリプレットをトレーニングすることなく、ターゲット画像を取得することを目的としている。我々は,T2IとI2Iを"検索-検証-精細化"パイプラインを介して統合する,トレーニング不要のフレームワークであるWISERを提案する。
参考スコア（独自算出の注目度）: 36.577766022251446
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.
Abstract（参考訳）: Zero-Shot Composed Image Retrieval (ZS-CIR)は、マルチモーダルクエリ(参照画像と修正テキストを含む)が与えられたターゲットイメージを、注釈付き三つ子をトレーニングすることなく検索することを目的としている。既存の方法は、通常、マルチモーダルクエリを単一のモダリティに変換するが、テキスト・トゥ・イメージ検索(T2I)や画像・トゥ・イメージ検索(I2I)の編集画像である。しかし、それぞれのパラダイムには固有の制限がある: T2Iはしばしば細粒度の視覚的詳細を失うが、I2Iは複雑なセマンティックな修正に苦労する。多様な問合せ意図の下でそれらの相補的強みを効果的に活用するために,我々は,T2IとI2Iを"検索・検証・精細"パイプラインを通じて統合し,意図の認識と不確実性認識を明示的にモデル化する,トレーニング不要のフレームワークであるWISERを提案する。具体的には、WISERが最初にワイドサーチを行い、編集されたキャプションと画像の両方を生成して並列検索を行い、候補プールを広げる。そして、アダプティブ・フュージョン(Adaptive Fusion)を検証器で実行し、検索の信頼性を評価し、不確実な検索の洗練をトリガーし、信頼性の高いものに対してデュアルパスを動的に融合させる。不確実な検索のために、WISERは構造化自己回帰を通して洗練された提案を生成し、次の検索ラウンドをより深い思考へと導く。大規模な実験では、WISERは複数のベンチマークで従来の方法よりも大幅に優れており、CIRCO(mAP@5)では45%、CIRR(recall@1)では57%の相対的な改善を実現している。特に、多くのトレーニング依存の手法を超越し、様々なシナリオにおけるその優位性と一般化を強調している。コードはhttps://github.com/Physicsmile/WISER.comでリリースされる。

論文の概要: WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

関連論文リスト