Referring Expression Instance Retrieval and A Strong End-to-End Baseline
- URL: http://arxiv.org/abs/2506.18246v3
- Date: Thu, 26 Jun 2025 11:31:17 GMT
- Title: Referring Expression Instance Retrieval and A Strong End-to-End Baseline
- Authors: Xiangzhao Hao, Kuan Zhu, Hongyu Guo, Haiyun Guo, Ning Jiang, Quan Lu, Ming Tang, Jinqiao Wang,
- Abstract summary: Text-Image Retrieval retrieves a target image from a gallery based on an image-level description.<n>Referring Expression localizes a target object within a given image using an instance-level description.<n>We introduce a new task called textbfReferring Expression Instance Retrieval (REIR), which supports both instance-level retrieval and localization.
- Score: 37.47466772169063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Using natural language to query visual information is a fundamental need in real-world applications. Text-Image Retrieval (TIR) retrieves a target image from a gallery based on an image-level description, while Referring Expression Comprehension (REC) localizes a target object within a given image using an instance-level description. However, real-world applications often present more complex demands. Users typically query an instance-level description across a large gallery and expect to receive both relevant image and the corresponding instance location. In such scenarios, TIR struggles with fine-grained descriptions and object-level localization, while REC is limited in its ability to efficiently search large galleries and lacks an effective ranking mechanism. In this paper, we introduce a new task called \textbf{Referring Expression Instance Retrieval (REIR)}, which supports both instance-level retrieval and localization based on fine-grained referring expressions. First, we propose a large-scale benchmark for REIR, named REIRCOCO, constructed by prompting advanced vision-language models to generate high-quality referring expressions for instances in the MSCOCO and RefCOCO datasets. Second, we present a baseline method, Contrastive Language-Instance Alignment with Relation Experts (CLARE), which employs a dual-stream architecture to address REIR in an end-to-end manner. Given a referring expression, the textual branch encodes it into a query embedding. The visual branch detects candidate objects and extracts their instance-level visual features. The most similar candidate to the query is selected for bounding box prediction. CLARE is first trained on object detection and REC datasets to establish initial grounding capabilities, then optimized via Contrastive Language-Instance Alignment (CLIA) for improved retrieval across images. We will release our code and benchmark publicly.
Related papers
- Composed Object Retrieval: Object-level Retrieval via Composed Expressions [71.47650333199628]
Composed Object Retrieval (COR) is a brand-new task that goes beyond image-level retrieval to achieve object-level precision.<n>We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories.<n>We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning.
arXiv Detail & Related papers (2025-08-06T13:11:40Z) - Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval [13.296362770269452]
Mask-aware TIR (MaTIR) aims to find relevant images based on a textual query.<n>We propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding.<n>We evaluate our approach on COCO and D$3$ datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.
arXiv Detail & Related papers (2025-06-28T12:19:49Z) - SORCE: Small Object Retrieval in Complex Environments [19.406708973678608]
We introduce SORCE (Small Object Retrieval in Complex Environments), a new subfield of T2IR, focusing on retrieving small objects in complex images with textual queries.<n>Preliminary analysis on SORCE-1K finds that existing T2IR methods struggle to capture small objects and encode all the semantics into a single embedding.<n>We leverage Multimodal Large Language Models (MLLMs) to extract multiple embeddings for each image instructed by a set of Regional Prompts (ReP)
arXiv Detail & Related papers (2025-05-30T10:23:05Z) - QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.<n>We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z) - RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models [24.67117013862316]
Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding.<n>We introduce a referring remote sensing image segmentation foundational model, RSRefSeg.<n> Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods.
arXiv Detail & Related papers (2025-01-12T13:22:35Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval [26.101116761577796]
Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption.
We introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset.
We present an open-domain benchmarking dataset named CIRCO, where each query is labeled with multiple ground truths and a semantic categorization.
arXiv Detail & Related papers (2024-05-05T14:39:06Z) - Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Referring Expression Comprehension Using Language Adaptive Inference [15.09309604460633]
This paper explores the adaptation between expressions and REC models for dynamic inference.
We propose a framework named Language Adaptive Subnets (LADS), which can extract language-adaptives from the REC model conditioned on the referring expressions.
Experiments on RefCOCO, RefCO+, RefCOCOg, and Referit show that the proposed method achieves faster inference speed and higher accuracy against state-of-the-art approaches.
arXiv Detail & Related papers (2023-06-06T07:58:59Z) - Zero-Shot Composed Image Retrieval with Textual Inversion [28.513594970580396]
Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption.
We propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset.
arXiv Detail & Related papers (2023-03-27T14:31:25Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Referring Transformer: A One-step Approach to Multi-task Visual
Grounding [45.42959940733406]
We propose a simple one-stage multi-task framework for visual grounding tasks.
Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder.
We show that our model benefits greatly from contextualized information and multi-task training.
arXiv Detail & Related papers (2021-06-06T10:53:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.