Related papers: FIGROTD: A Friendly-to-Handle Dataset for Image Guided Retrieval with Optional Text

FIGROTD: A Friendly-to-Handle Dataset for Image Guided Retrieval with Optional Text

URL: http://arxiv.org/abs/2511.22247v1
Date: Thu, 27 Nov 2025 09:18:56 GMT
Title: FIGROTD: A Friendly-to-Handle Dataset for Image Guided Retrieval with Optional Text
Authors: Hoang-Bao Le, Allie Tran, Binh T. Nguyen, Liting Zhou, Cathal Gurrin,
Abstract summary: Image-Guided Retrieval with Optional Text (IGROT) unifies visual retrieval (without text) and composed retrieval (with text)<n>We introduce FIGROTD, a lightweight yet high-quality IGROT dataset with 16,474 training triplets and 1,262 test triplets.<n>Trained on FIGROTD, VaGFeM achieves competitive results on nine benchmarks, reaching 34.8 mAP@10 on CIRCO and 75.7 mAP@200 on Sketchy.
Score: 3.6723140587841656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image-Guided Retrieval with Optional Text (IGROT) unifies visual retrieval (without text) and composed retrieval (with text). Despite its relevance in applications like Google Image and Bing, progress has been limited by the lack of an accessible benchmark and methods that balance performance across subtasks. Large-scale datasets such as MagicLens are comprehensive but computationally prohibitive, while existing models often favor either visual or compositional queries. We introduce FIGROTD, a lightweight yet high-quality IGROT dataset with 16,474 training triplets and 1,262 test triplets across CIR, SBIR, and CSTBIR. To reduce redundancy, we propose the Variance Guided Feature Mask (VaGFeM), which selectively enhances discriminative dimensions based on variance statistics. We further adopt a dual-loss design (InfoNCE + Triplet) to improve compositional reasoning. Trained on FIGROTD, VaGFeM achieves competitive results on nine benchmarks, reaching 34.8 mAP@10 on CIRCO and 75.7 mAP@200 on Sketchy, outperforming stronger baselines despite fewer triplets.

Related papers

UNION: A Lightweight Target Representation for Efficient Zero-Shot Image-Guided Retrieval with Optional Textual Queries [3.6723140587841656]
Image-Guided Retrieval with Optional Text (IGROT) is a general retrieval setting where a query consists of an anchor image, with or without accompanying text, aiming to retrieve semantically relevant target images.<n>In this work, we address IGROT under low-data supervision by introducing UNION, a lightweight and generalisable target representation that fuses the image embedding with a null-text prompt.
arXiv Detail & Related papers (2025-11-27T09:28:28Z)
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions [81.33113485830711]
We introduce a vision-free, single-encoder retrieval pipeline for vision-language models.<n>We migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions.<n>Our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks.
arXiv Detail & Related papers (2025-09-23T16:22:27Z)
VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval [56.12310817934239]
Cross-modal embeddings behave as bags of concepts and underrepresent structured visual relationships such as pose and viewpoint.<n>We propose Visualize-then-Retrieve (VisRet), a new paradigm for T2I retrieval that mitigates this limitation of cross-modal similarity alignment.<n>VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching.
arXiv Detail & Related papers (2025-05-26T17:59:33Z)
Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval [52.709090256954276]
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query.<n>We propose a novel framework by employing a Multimodal Reasoning Agent (MRA) for ZS-CIR.
arXiv Detail & Related papers (2025-05-26T13:17:50Z)
CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query.<n>We present CoLLM, a one-stop framework that generates triplets on-the-fly from image-caption pairs.<n>We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts.
arXiv Detail & Related papers (2025-03-25T17:59:50Z)
Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z)
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora [3.166549403591528]
This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective long-text to image retrieval. CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively.
arXiv Detail & Related papers (2024-02-23T11:47:16Z)
Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z)
A Recipe for Efficient SBIR Models: Combining Relative Triplet Loss with Batch Normalization and Knowledge Distillation [3.364554138758565]
Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query. We introduce a Relative Triplet Loss (RTL), an adapted triplet loss to overcome limitations through loss weighting based on anchors similarity. We propose a straightforward approach to train small models efficiently with a marginal loss of accuracy through knowledge distillation.
arXiv Detail & Related papers (2023-05-30T12:41:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.