Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and
Gallery Banks
- URL: http://arxiv.org/abs/2310.11612v1
- Date: Tue, 17 Oct 2023 22:10:17 GMT
- Title: Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and
Gallery Banks
- Authors: Yimu Wang, Xiangru Jian, Bo Xue
- Abstract summary: Hubness is a phenomenon where a small number of gallery data points are frequently retrieved, resulting in a decline in retrieval performance.
We show the necessity of incorporating both the gallery and query data for addressing hubness as hubs always exhibit high similarity with gallery and query data.
We present extensive experimental results on diverse language-grounded benchmarks, including text-image, text-video, and text-audio.
- Score: 5.164924773752648
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this work, we present a post-processing solution to address the hubness
problem in cross-modal retrieval, a phenomenon where a small number of gallery
data points are frequently retrieved, resulting in a decline in retrieval
performance. We first theoretically demonstrate the necessity of incorporating
both the gallery and query data for addressing hubness as hubs always exhibit
high similarity with gallery and query data. Second, building on our
theoretical results, we propose a novel framework, Dual Bank Normalization
(DBNorm). While previous work has attempted to alleviate hubness by only
utilizing the query samples, DBNorm leverages two banks constructed from the
query and gallery samples to reduce the occurrence of hubs during inference.
Next, to complement DBNorm, we introduce two novel methods, dual inverted
softmax and dual dynamic inverted softmax, for normalizing similarity based on
the two banks. Specifically, our proposed methods reduce the similarity between
hubs and queries while improving the similarity between non-hubs and queries.
Finally, we present extensive experimental results on diverse language-grounded
benchmarks, including text-image, text-video, and text-audio, demonstrating the
superior performance of our approaches compared to previous methods in
addressing hubness and boosting retrieval performance. Our code is available at
https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval.
Related papers
- Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection [17.406051477690134]
Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems.
We propose a novel hierarchical feature refinement network for event-frame fusion.
Our method exhibits significantly better robustness when introducing 15 different corruption types to the frame images.
arXiv Detail & Related papers (2024-07-17T14:09:46Z) - Direct Diffusion Bridge using Data Consistency for Inverse Problems [65.04689839117692]
Diffusion model-based inverse problem solvers have shown impressive performance, but are limited in speed.
Several recent works have tried to alleviate this problem by building a diffusion process, directly bridging the clean and the corrupted.
We propose a modified inference procedure that imposes data consistency without the need for fine-tuning.
arXiv Detail & Related papers (2023-05-31T12:51:10Z) - Learnable Pillar-based Re-ranking for Image-Text Retrieval [119.9979224297237]
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities.
Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks.
We propose a novel learnable pillar-based re-ranking paradigm for image-text retrieval.
arXiv Detail & Related papers (2023-04-25T04:33:27Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Cross Modal Retrieval with Querybank Normalisation [41.877255953069074]
We show that state-of-the-art joint embeddings suffer from the longstanding hubness problem.
We formulate a simple but effective framework that re-normalises query similarities to account for hubs in the embedding space.
We show that QB-Norm works effectively without concurrent access to any test set queries.
arXiv Detail & Related papers (2021-12-23T18:51:58Z) - Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image.
We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model.
Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z) - BSN++: Complementary Boundary Regressor with Scale-Balanced Relation
Modeling for Temporal Action Proposal Generation [85.13713217986738]
We present BSN++, a new framework which exploits complementary boundary regressor and relation modeling for temporal proposal generation.
Not surprisingly, the proposed BSN++ ranked 1st place in the CVPR19 - ActivityNet challenge leaderboard on temporal action localization task.
arXiv Detail & Related papers (2020-09-15T07:08:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.