Related papers: Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

URL: http://arxiv.org/abs/2305.16304v3
Date: Mon, 29 Jan 2024 05:03:55 GMT
Title: Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder
Authors: Zheyuan Liu, Weixuan Sun, Damien Teney, Stephen Gould
Abstract summary: Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. We propose to combine the merits of both schemes using a two-stage model.
Score: 45.60134971181856
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task. Our implementation is available at https://github.com/Cuberick-Orion/Candidate-Reranking-CIR.

Related papers

Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy [36.03315207229038]
HEAVEN is a two-stage hybrid-vector framework for visually rich document retrieval.<n>It efficiently retrieves candidate pages using a single-vector method over Visually-Summarized Pages.<n>It reranks candidates with a multi-vector method while filtering query tokens by linguistic importance to reduce redundant computations.
arXiv Detail & Related papers (2025-10-25T08:27:37Z)
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions [81.33113485830711]
We introduce a vision-free, single-encoder retrieval pipeline for vision-language models.<n>We migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions.<n>Our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks.
arXiv Detail & Related papers (2025-09-23T16:22:27Z)
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction [13.70527493534928]
We introduce MetaEmbed, a new framework for multimodal retrieval.<n>During training, a fixed number of learnable Meta Tokens are appended to the input sequence.<n>At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings.
arXiv Detail & Related papers (2025-09-22T17:59:42Z)
Chain-of-Thought Re-ranking for Image Retrieval Tasks [16.13448876168839]
We propose a novel Chain-of-Thought Re-Ranking (CoTRR) method to address image retrieval.<n>By allowing MLLM to perform listwise reasoning, our method supports global comparison, consistent reasoning, and interpretable decision-making.<n>Our method achieves state-of-the-art performance across three image retrieval tasks, including text-to-image retrieval (TIR), composed image retrieval (CIR) and chat-based image retrieval (Chat-IR)
arXiv Detail & Related papers (2025-09-18T08:48:46Z)
Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval [18.41953329648681]
We propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM)<n>On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average.
arXiv Detail & Related papers (2025-07-31T06:57:28Z)
DeepClean: Integrated Distortion Identification and Algorithm Selection for Rectifying Image Corruptions [1.8024397171920883]
We propose a two-level sequential planning approach for automated image distortion classification and rectification. The advantage of our approach is its dynamic reconfiguration, conditioned on the input image and generalisability to unseen candidate algorithms at inference time.
arXiv Detail & Related papers (2024-07-23T08:57:11Z)
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. We train a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z)
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora [3.166549403591528]
This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective long-text to image retrieval. CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively.
arXiv Detail & Related papers (2024-02-23T11:47:16Z)
Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z)
If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt. We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts. We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z)
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval [51.588385824875886]
Cross-modal retrieval consists in finding images related to a given query text or vice-versa. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. This paper proposes an ALign And DIstill Network (ALADIN) to fill in the gap between effectiveness and efficiency.
arXiv Detail & Related papers (2022-07-29T16:01:48Z)
Multi-scale 2D Representation Learning for weakly-supervised moment retrieval [18.940164141627914]
We propose a Multi-scale 2D Representation Learning method for weakly supervised video moment retrieval. Specifically, we first construct a two-dimensional map for each temporal scale to capture the temporal dependencies between candidates. We select top-K candidates from each scale-varied map with a learnable convolutional neural network.
arXiv Detail & Related papers (2021-11-04T10:48:37Z)
Exploring Dense Retrieval for Dialogue Response Selection [42.89426092886912]
We present a solution to directly select proper responses from a large corpus or even a nonparallel corpus, using a dense retrieval model. For re-rank setting, the superiority is quite surprising given its simplicity. For full-rank setting, we can emphasize that we are the first to do such evaluation.
arXiv Detail & Related papers (2021-10-13T10:10:32Z)
UniCon: Unified Context Network for Robust Active Speaker Detection [111.90529347692723]
We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD) Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information. A thorough ablation study is performed on several challenging ASD benchmarks under different settings.
arXiv Detail & Related papers (2021-08-05T13:25:44Z)
Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.