Related papers: Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks

Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks

URL: http://arxiv.org/abs/2310.11612v1
Date: Tue, 17 Oct 2023 22:10:17 GMT
Title: Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks
Authors: Yimu Wang, Xiangru Jian, Bo Xue
Abstract summary: Hubness is a phenomenon where a small number of gallery data points are frequently retrieved, resulting in a decline in retrieval performance. We show the necessity of incorporating both the gallery and query data for addressing hubness as hubs always exhibit high similarity with gallery and query data. We present extensive experimental results on diverse language-grounded benchmarks, including text-image, text-video, and text-audio.
Score: 5.164924773752648
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In this work, we present a post-processing solution to address the hubness problem in cross-modal retrieval, a phenomenon where a small number of gallery data points are frequently retrieved, resulting in a decline in retrieval performance. We first theoretically demonstrate the necessity of incorporating both the gallery and query data for addressing hubness as hubs always exhibit high similarity with gallery and query data. Second, building on our theoretical results, we propose a novel framework, Dual Bank Normalization (DBNorm). While previous work has attempted to alleviate hubness by only utilizing the query samples, DBNorm leverages two banks constructed from the query and gallery samples to reduce the occurrence of hubs during inference. Next, to complement DBNorm, we introduce two novel methods, dual inverted softmax and dual dynamic inverted softmax, for normalizing similarity based on the two banks. Specifically, our proposed methods reduce the similarity between hubs and queries while improving the similarity between non-hubs and queries. Finally, we present extensive experimental results on diverse language-grounded benchmarks, including text-image, text-video, and text-audio, demonstrating the superior performance of our approaches compared to previous methods in addressing hubness and boosting retrieval performance. Our code is available at https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval.

Related papers

Hubness Reduction with Dual Bank Sinkhorn Normalization for Cross-Modal Retrieval [12.329352187335312]
Hubness is a phenomenon where a small number of targets frequently appear as nearest neighbors to numerous queries.<n>Despite several proposed methods to reduce hubness, their underlying mechanisms remain poorly understood.<n>We propose a probability-balancing framework for more effective hubness reduction.
arXiv Detail & Related papers (2025-08-04T15:45:48Z)
NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval [15.409022911063241]
NeighborRetr is a novel method that balances the learning of hubs and adaptively adjusts the relations of various kinds of neighbors. We show that NeighborRetr achieves state-of-the-art results on multiple cross-modal retrieval benchmarks.
arXiv Detail & Related papers (2025-03-13T16:33:55Z)
Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection [17.406051477690134]
Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems. We propose a novel hierarchical feature refinement network for event-frame fusion. Our method exhibits significantly better robustness when introducing 15 different corruption types to the frame images.
arXiv Detail & Related papers (2024-07-17T14:09:46Z)
ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval. ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z)
Direct Diffusion Bridge using Data Consistency for Inverse Problems [65.04689839117692]
Diffusion model-based inverse problem solvers have shown impressive performance, but are limited in speed. Several recent works have tried to alleviate this problem by building a diffusion process, directly bridging the clean and the corrupted. We propose a modified inference procedure that imposes data consistency without the need for fine-tuning.
arXiv Detail & Related papers (2023-05-31T12:51:10Z)
Learnable Pillar-based Re-ranking for Image-Text Retrieval [119.9979224297237]
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks. We propose a novel learnable pillar-based re-ranking paradigm for image-text retrieval.
arXiv Detail & Related papers (2023-04-25T04:33:27Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)
Cross Modal Retrieval with Querybank Normalisation [41.877255953069074]
We show that state-of-the-art joint embeddings suffer from the longstanding hubness problem. We formulate a simple but effective framework that re-normalises query similarities to account for hubs in the embedding space. We show that QB-Norm works effectively without concurrent access to any test set queries.
arXiv Detail & Related papers (2021-12-23T18:51:58Z)
Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image. We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model. Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z)
BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation [85.13713217986738]
We present BSN++, a new framework which exploits complementary boundary regressor and relation modeling for temporal proposal generation. Not surprisingly, the proposed BSN++ ranked 1st place in the CVPR19 - ActivityNet challenge leaderboard on temporal action localization task.
arXiv Detail & Related papers (2020-09-15T07:08:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.