Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval
- URL: http://arxiv.org/abs/2603.01082v1
- Date: Sun, 01 Mar 2026 12:53:47 GMT
- Title: Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval
- Authors: Xuan Lu, Kangle Li, Haohang Huang, Rui Meng, Wenjun Zeng, Xiaoyu Shen,
- Abstract summary: MCMR (Multi-Conditional Multimodal Retrieval) is a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries.<n>It spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture.<n>We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities.
- Score: 27.493644447594367
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities. To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval): a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries. MCMR spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture. It also preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance. We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities. Experimental results reveal: (i) distinct modality asymmetries across models; (ii) visual cues dominate early-rank precision, while textual metadata stabilizes long-tail ordering; and (iii) MLLM-based pointwise rerankers markedly improve fine-grained matching by explicitly verifying query-candidate consistency. Overall, MCMR establishes a challenging and diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding. Our code and dataset is available at https://github.com/EIT-NLP/MCMR
Related papers
- MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model [57.89395815934156]
Multi-Turn Contrastive Learning (MuCo) is a dialogue-inspired framework that revisits this process.<n>Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T)
arXiv Detail & Related papers (2026-02-06T05:18:33Z) - Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation [47.714317480436215]
PREMIR is a simple framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval.<n> Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain and multilingual settings.
arXiv Detail & Related papers (2025-08-23T16:14:41Z) - Learning Item Representations Directly from Multimodal Features for Effective Recommendation [51.49251689107541]
multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations.<n>We propose a novel model (i.e., LIRDRec) that learns item representations directly from multimodal features to augment recommendation performance.
arXiv Detail & Related papers (2025-05-08T05:42:22Z) - MultiConIR: Towards multi-condition Information Retrieval [38.864056667809095]
MultiConIR is a benchmark designed to evaluate retrieval and reranking models under nuanced multi-condition query scenarios.<n>Most retrievers and rerankers exhibit severe performance degradation as query complexity increases.<n>This work delves into the factors contributing to reranker performance deterioration and examines how condition positioning within queries affects similarity assessment.
arXiv Detail & Related papers (2025-03-11T05:02:03Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model [9.224965304457708]
This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search framework.<n>It incorporates image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering.<n>Experiments on real-word datasets and public benchmarks on knowledge-based VQA and safety demonstrated that CUE-M outperforms baselines and establishes new state-of-the-art results.
arXiv Detail & Related papers (2024-11-19T07:16:48Z) - MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)<n>We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.<n>Our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR.
arXiv Detail & Related papers (2024-11-04T20:06:34Z) - An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models [21.892975397847316]
We present an interactive Multi-modal Query Answering (MQA) system, empowered by our newly developed multi-modal retrieval framework and navigation graph index.
One notable aspect of MQA is its utilization of contrastive learning to assess the significance of different modalities.
The system achieves efficient retrieval through our advanced navigation graph index, refined using computational pruning techniques.
arXiv Detail & Related papers (2024-07-05T02:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.