Related papers: ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections

ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections

URL: http://arxiv.org/abs/2506.15180v1
Date: Wed, 18 Jun 2025 06:52:10 GMT
Title: ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections
Authors: Ziling Huang, Yidan Zhang, Shin'ichi Satoh,
Abstract summary: We introduce Referring Search and Discovery (ReSeDis), the first task that unifies corpus-level retrieval with pixel-level grounding.<n>To enable rigorous study, we curate a benchmark in which every description maps uniquely to object instances scattered across a large, diverse corpus.<n>ReSeDis offers a realistic, end-to-end testbed for building the next generation of robust and scalable multimodal search systems.
Score: 14.076781094343362
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale visual search engines are expected to solve a dual problem at once: (i) locate every image that truly contains the object described by a sentence and (ii) identify the object's bounding box or exact pixels within each hit. Existing techniques address only one side of this challenge. Visual grounding yields tight boxes and masks but rests on the unrealistic assumption that the object is present in every test image, producing a flood of false alarms when applied to web-scale collections. Text-to-image retrieval excels at sifting through massive databases to rank relevant images, yet it stops at whole-image matches and offers no fine-grained localization. We introduce Referring Search and Discovery (ReSeDis), the first task that unifies corpus-level retrieval with pixel-level grounding. Given a free-form description, a ReSeDis model must decide whether the queried object appears in each image and, if so, where it is, returning bounding boxes or segmentation masks. To enable rigorous study, we curate a benchmark in which every description maps uniquely to object instances scattered across a large, diverse corpus, eliminating unintended matches. We further design a task-specific metric that jointly scores retrieval recall and localization precision. Finally, we provide a straightforward zero-shot baseline using a frozen vision-language model, revealing significant headroom for future study. ReSeDis offers a realistic, end-to-end testbed for building the next generation of robust and scalable multimodal search systems.

Related papers

Composed Object Retrieval: Object-level Retrieval via Composed Expressions [71.47650333199628]
Composed Object Retrieval (COR) is a brand-new task that goes beyond image-level retrieval to achieve object-level precision.<n>We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories.<n>We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning.
arXiv Detail & Related papers (2025-08-06T13:11:40Z)
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding. Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z)
FORB: A Flat Object Retrieval Benchmark for Universal Image Embedding [7.272083488859574]
We introduce a new dataset for benchmarking visual search methods on flat images with diverse patterns. Our flat object retrieval benchmark (FORB) supplements the commonly adopted 3D object domain. It serves as a testbed for assessing the image embedding quality on out-of-distribution domains.
arXiv Detail & Related papers (2023-09-28T08:41:51Z)
Image Segmentation-based Unsupervised Multiple Objects Discovery [1.7674345486888503]
Unsupervised object discovery aims to localize objects in images. We propose a fully unsupervised, bottom-up approach, for multiple objects discovery. We provide state-of-the-art results for both unsupervised class-agnostic object detection and unsupervised image segmentation.
arXiv Detail & Related papers (2022-12-20T09:48:24Z)
Scrape, Cut, Paste and Learn: Automated Dataset Generation Applied to Parcel Logistics [58.720142291102135]
We present a fully automated pipeline to generate a synthetic dataset for instance segmentation in four steps. We first scrape images for the objects of interest from popular image search engines. We compare three different methods for image selection: Object-agnostic pre-processing, manual image selection and CNN-based image selection.
arXiv Detail & Related papers (2022-10-18T12:49:04Z)
Few-shot Object Counting and Detection [25.61294147822642]
We tackle a new task of few-shot object counting and detection. Given a few exemplar bounding boxes of a target object class, we seek to count and detect all objects of the target class. This task shares the same supervision as the few-shot object counting but additionally outputs the object bounding boxes along with the total object count. We introduce a novel two-stage training strategy and a novel uncertainty-aware few-shot object detector: Counting-DETR.
arXiv Detail & Related papers (2022-07-22T10:09:18Z)
Learning to Detect Every Thing in an Open World [139.78830329914135]
We propose a simple yet surprisingly powerful data augmentation and training scheme we call Learning to Detect Every Thing (LDET) To avoid suppressing hidden objects, background objects that are visible but unlabeled, we paste annotated objects on a background image sampled from a small region of the original image. LDET leads to significant improvements on many datasets in the open world instance segmentation task.
arXiv Detail & Related papers (2021-12-03T03:56:06Z)
Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets. This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets. We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
Addressing Visual Search in Open and Closed Set Settings [8.928169373673777]
We present a method for predicting pixel-level objectness from a low resolution gist image. We then use to select regions for performing object detection locally at high resolution. Second, we propose a novel strategy for open-set visual search that seeks to find all instances of a target class which may be previously unseen.
arXiv Detail & Related papers (2020-12-11T17:21:28Z)
Tasks Integrated Networks: Joint Detection and Retrieval for Image Search [99.49021025124405]
In many real-world searching scenarios (e.g., video surveillance), the objects are seldom accurately detected or annotated. We first introduce an end-to-end Integrated Net (I-Net), which has three merits. We further propose an improved I-Net, called DC-I-Net, which makes two new contributions.
arXiv Detail & Related papers (2020-09-03T03:57:50Z)
Compact Deep Aggregation for Set Retrieval [87.52470995031997]
We focus on retrieving images containing multiple faces from a large scale dataset of images. Here the set consists of the face descriptors in each image, and given a query for multiple identities, the goal is then to retrieve, in order, images which contain all the identities. We show that this compact descriptor has minimal loss of discriminability up to two faces per image, and degrades slowly after that.
arXiv Detail & Related papers (2020-03-26T08:43:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.