PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
- URL: http://arxiv.org/abs/2603.04598v1
- Date: Wed, 04 Mar 2026 20:55:30 GMT
- Title: PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
- Authors: Rohan Mahadev, Joyce Yuan, Patrick Poirson, David Xue, Hao-Yu Wu, Dmitry Kislyuk,
- Abstract summary: Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers.<n>We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments.
- Score: 3.889218009169166
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.
Related papers
- IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering [1.4427879901952518]
We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page.<n>Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems.<n>We analyze the limitations of unimodal text and image representations and identify question types that require one modality over the other.
arXiv Detail & Related papers (2026-02-05T21:57:43Z) - FIGROTD: A Friendly-to-Handle Dataset for Image Guided Retrieval with Optional Text [3.6723140587841656]
Image-Guided Retrieval with Optional Text (IGROT) unifies visual retrieval (without text) and composed retrieval (with text)<n>We introduce FIGROTD, a lightweight yet high-quality IGROT dataset with 16,474 training triplets and 1,262 test triplets.<n>Trained on FIGROTD, VaGFeM achieves competitive results on nine benchmarks, reaching 34.8 mAP@10 on CIRCO and 75.7 mAP@200 on Sketchy.
arXiv Detail & Related papers (2025-11-27T09:18:56Z) - HuggingR$^{4}$: A Progressive Reasoning Framework for Discovering Optimal Model Companions [50.61510609116118]
HuggingR$4$ is a novel framework that combines Reasoning, Retrieval, Refinement, and Reflection to efficiently select models.<n>It attains a workability rate of 92.03% and a reasonability rate of 82.46%, surpassing existing method by 26.51% and 33.25% respectively.
arXiv Detail & Related papers (2025-11-24T03:13:45Z) - MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval [86.35779264575154]
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios.<n>We introduce MR$2$-Bench, a reasoning-intensive benchmark for multimodal retrieval.
arXiv Detail & Related papers (2025-09-30T15:09:14Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - Chain-of-Thought Re-ranking for Image Retrieval Tasks [16.13448876168839]
We propose a novel Chain-of-Thought Re-Ranking (CoTRR) method to address image retrieval.<n>By allowing MLLM to perform listwise reasoning, our method supports global comparison, consistent reasoning, and interpretable decision-making.<n>Our method achieves state-of-the-art performance across three image retrieval tasks, including text-to-image retrieval (TIR), composed image retrieval (CIR) and chat-based image retrieval (Chat-IR)
arXiv Detail & Related papers (2025-09-18T08:48:46Z) - ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge [50.93758649363798]
Impliret is a benchmark that shifts the reasoning challenge to document-side processing.<n>We evaluate a range of sparse and dense retrievers, all of which struggle in this setting.
arXiv Detail & Related papers (2025-06-17T11:08:29Z) - Zero Shot Composed Image Retrieval [0.0]
Composed image retrieval (CIR) allows a user to locate a target image by applying a fine-grained textual edit.<n>Zero-shot CIR, which embeds the image and the text with separate pretrained vision-language encoders, reaches only 20-25% Recall@10 on the FashionIQ benchmark.<n>We improve this by fine-tuning BLIP-2 with a lightweight Q-Former that fuses visual and textual features into a single embedding.
arXiv Detail & Related papers (2025-06-07T00:38:43Z) - LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z) - Efficient and Discriminative Image Feature Extraction for Universal Image Retrieval [1.907072234794597]
We develop a framework for a universal feature extractor that provides strong semantic image representations across various domains.
We achieve near state-of-the-art results on the Google Universal Image Embedding Challenge, with a mMP@5 of 0.721.
Compared to methods with similar computational requirements, we outperform the previous state of the art by 3.3 percentage points.
arXiv Detail & Related papers (2024-09-20T13:53:13Z) - Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark [63.296342841358815]
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images.<n>The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering.<n>We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
arXiv Detail & Related papers (2024-07-18T17:59:30Z) - Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs [61.01278660925202]
Dysca is a dynamic and scalable benchmark for evaluating LVLMs by leveraging synthesis images.<n>We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks.<n>Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios.
arXiv Detail & Related papers (2024-06-27T02:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.