Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval
- URL: http://arxiv.org/abs/2506.22864v1
- Date: Sat, 28 Jun 2025 12:19:49 GMT
- Title: Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval
- Authors: Li-Cheng Shen, Jih-Kang Hsieh, Wei-Hua Li, Chu-Song Chen,
- Abstract summary: Mask-aware TIR (MaTIR) aims to find relevant images based on a textual query.<n>We propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding.<n>We evaluate our approach on COCO and D$3$ datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.
- Score: 13.296362770269452
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text-to-image retrieval (TIR) aims to find relevant images based on a textual query, but existing approaches are primarily based on whole-image captions and lack interpretability. Meanwhile, referring expression segmentation (RES) enables precise object localization based on natural language descriptions but is computationally expensive when applied across large image collections. To bridge this gap, we introduce Mask-aware TIR (MaTIR), a new task that unifies TIR and RES, requiring both efficient image search and accurate object segmentation. To address this task, we propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding with a multimodal large language model (MLLM). We leverage SAM 2 to generate object masks and Alpha-CLIP to extract region-level embeddings offline at first, enabling effective and scalable online retrieval. Secondly, MLLM is used to refine retrieval rankings and generate bounding boxes, which are matched to segmentation masks. We evaluate our approach on COCO and D$^3$ datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.
Related papers
- Referring Expression Instance Retrieval and A Strong End-to-End Baseline [37.47466772169063]
Text-Image Retrieval retrieves a target image from a gallery based on an image-level description.<n>Referring Expression localizes a target object within a given image using an instance-level description.<n>We introduce a new task called textbfReferring Expression Instance Retrieval (REIR), which supports both instance-level retrieval and localization.
arXiv Detail & Related papers (2025-06-23T02:28:44Z) - Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation [12.67400143793047]
We propose a framework named textitprompt-generated semantic localization guiding Segment Anything Model(PSLG-SAM)<n>PSLG-SAM decomposes the Reference Remote Sensing Image (RRSIS) task into two stages: coarse localization and fine segmentation.<n> Notably, the second stage can be train-free, significantly reducing the annotation data burden for the RRSIS task.
arXiv Detail & Related papers (2025-06-12T09:04:07Z) - MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval [50.062817677022586]
Zero-Shot Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens.<n>We propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI) to construct two complementary training tasks using only unlabeled images.
arXiv Detail & Related papers (2025-05-26T08:56:59Z) - LlamaSeg: Image Segmentation via Autoregressive Mask Generation [46.17509085054758]
We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions.<n>We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs.
arXiv Detail & Related papers (2025-05-26T02:22:41Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models [2.331828779757202]
We present CALICO, the first Large Vision-Language Models (LVLM) designed for multi-image part-level reasoning segmentation.<n> CALICO features two key components, a novel Correspondence Extraction Module that identifies semantic part-level correspondences, and Adaptation Correspondence Modules that embed this information into the LVLM.<n>We show that CALICO, with just 0.3% of its parameters finetuned, achieves strong performance on this challenging task.
arXiv Detail & Related papers (2024-12-26T18:59:37Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - StrucTexTv2: Masked Visual-Textual Prediction for Document Image
Pre-training [64.37272287179661]
StrucTexTv2 is an effective document image pre-training framework.
It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling.
It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
arXiv Detail & Related papers (2023-03-01T07:32:51Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.