Fugu-MT 論文翻訳(概要): Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

論文の概要: Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

arxiv url: http://arxiv.org/abs/2605.14787v2
Date: Fri, 15 May 2026 21:26:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 23:51:08.282349
Title: Do Composed Image Retrieval Benchmarks Require Multimodal Composition?
Title（参考訳）: 合成画像検索ベンチマークはマルチモーダル構成を必要とするか?
Authors: Matteo Attimonelli, Alessandro De Bellis, Aryo Pradipta Gema, Rohit Saxena, Monica Sekoyan, Wai-Chung Kwan, Claudio Pomo, Alessandro Suglia, Dietmar Jannach, Tommaso Di Noia, Pasquale Minervini,
Abstract要約: 合成画像検索ベンチマークでは,マルチモーダルな構成が必要であると推定される。 4つの広く使用されているCIRベンチマークと11のジェネリストマルチモーダルエンベディングモデルで、クエリの大部分が単一モードで解決できる。 CIRの性能は、真のマルチモーダル合成ではなく、単一モーダル信号から生じる。
参考スコア（独自算出の注目度）: 86.99911649534795
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.
Abstract（参考訳）: Composed Image Retrieval (CIR) は、クエリが参照画像とテキスト修正からなるマルチモーダル検索タスクであり、その目標は、両方を満たすターゲット画像を取得することである。原則として、CIRベンチマークの強い性能はマルチモーダルな構成、すなわち参照画像からの補完的な情報とテキストの修正を必要とすると仮定される。この研究では、この仮定が常に成り立つとは限らないことを示す。広く使われている4つのCIRベンチマークと11のジェネリストマルチモーダル・エンベディング・モデルで、多くのクエリを単一のモダリティ(32.2%から83.6%)で解き、広範に不定形ショートカットが現れる。したがって、真のマルチモーダル合成ではなく、単調信号から高いCIR性能が得られる。この問題をよりよく理解するために、私たちは2段階の監査を行います。まず、クロスモデル解析により、ショートカット解決可能なクエリを同定する。第2に、4,741個のショートカットのないクエリに対して人間による検証を行う。この検証されたサブセット上での再評価モデルは、定性的に異なる振る舞いを示す: クエリはもはや単一のモダリティで解決できず、うまく検索するには両方の入力を組み合わせる必要がある。精度は低下するが、マルチモーダル情報への依存は増大する。全体として、現在のCIRベンチマークでは、ショートカット解決可能、ノイズ、真に構成的なクエリが説明されており、マルチモーダル合成におけるモデル能力の過大評価につながっている。

論文の概要: Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

関連論文リスト