Learnable Pillar-based Re-ranking for Image-Text Retrieval
- URL: http://arxiv.org/abs/2304.12570v1
- Date: Tue, 25 Apr 2023 04:33:27 GMT
- Title: Learnable Pillar-based Re-ranking for Image-Text Retrieval
- Authors: Leigang Qu, Meng Liu, Wenjie Wang, Zhedong Zheng, Liqiang Nie,
Tat-Seng Chua
- Abstract summary: Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities.
Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks.
We propose a novel learnable pillar-based re-ranking paradigm for image-text retrieval.
- Score: 119.9979224297237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text retrieval aims to bridge the modality gap and retrieve cross-modal
content based on semantic similarities. Prior work usually focuses on the
pairwise relations (i.e., whether a data sample matches another) but ignores
the higher-order neighbor relations (i.e., a matching structure among multiple
data samples). Re-ranking, a popular post-processing practice, has revealed the
superiority of capturing neighbor relations in single-modality retrieval tasks.
However, it is ineffective to directly extend existing re-ranking algorithms to
image-text retrieval. In this paper, we analyze the reason from four
perspectives, i.e., generalization, flexibility, sparsity, and asymmetry, and
propose a novel learnable pillar-based re-ranking paradigm. Concretely, we
first select top-ranked intra- and inter-modal neighbors as pillars, and then
reconstruct data samples with the neighbor relations between them and the
pillars. In this way, each sample can be mapped into a multimodal pillar space
only using similarities, ensuring generalization. After that, we design a
neighbor-aware graph reasoning module to flexibly exploit the relations and
excavate the sparse positive items within a neighborhood. We also present a
structure alignment constraint to promote cross-modal collaboration and align
the asymmetric modalities. On top of various base backbones, we carry out
extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO,
demonstrating the effectiveness, superiority, generalization, and
transferability of our proposed re-ranking paradigm.
Related papers
- SEG:Seeds-Enhanced Iterative Refinement Graph Neural Network for Entity Alignment [13.487673375206276]
This paper presents a soft label propagation framework that integrates multi-source data and iterative seed enhancement.
A bidirectional weighted joint loss function is implemented, which reduces the distance between positive samples and differentially processes negative samples.
Our method outperforms existing semi-supervised approaches, as evidenced by superior results on multiple datasets.
arXiv Detail & Related papers (2024-10-28T04:50:46Z) - Multimodal Relational Triple Extraction with Query-based Entity Object Transformer [20.97497765985682]
Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge.
We propose Multimodal Entity-Object Triple Extraction, which aims to extract all triples (entity, relation, object region) from image-text pairs.
We also propose QEOT, a query-based model with a selective attention mechanism to dynamically explore the interaction and fusion of textual and visual information.
arXiv Detail & Related papers (2024-08-16T12:43:38Z) - Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching.
An anchor branch is first trained to provide insights into the data properties.
A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z) - DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking [31.15972952813689]
We propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks.
DRIN explicitly models four different types of alignment between a mention and entity and builds a dynamic Graph Convolutional Network (GCN) to dynamically select the corresponding alignment relations for different input samples.
Experiments on two datasets show that DRIN outperforms state-of-the-art methods by a large margin, demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2023-10-09T10:21:42Z) - Hierarchical Matching and Reasoning for Multi-Query Image Retrieval [113.44470784756308]
We propose a novel Hierarchical Matching and Reasoning Network (HMRN) for Multi-Query Image Retrieval (MQIR)
It disentangles MQIR into three hierarchical semantic representations, which is responsible to capture fine-grained local details, contextual global scopes, and high-level inherent correlations.
Our HMRN substantially surpasses the current state-of-the-art methods.
arXiv Detail & Related papers (2023-06-26T07:03:56Z) - Adaptive Similarity Bootstrapping for Self-Distillation based
Representation Learning [40.94237853380154]
NNCLR goes beyond the cross-view paradigm and uses positive pairs from different images obtained via nearest neighbor bootstrapping in a contrastive setting.
We empirically show that as opposed to the contrastive learning setting which relies on negative samples, incorporating nearest neighbor bootstrapping in a self-distillation scheme can lead to a performance drop or even collapse.
We propose to adaptively bootstrap neighbors based on the estimated quality of the latent space.
arXiv Detail & Related papers (2023-03-23T18:40:17Z) - BiCro: Noisy Correspondence Rectification for Multi-modality Data via
Bi-directional Cross-modal Similarity Consistency [66.8685113725007]
BiCro aims to estimate soft labels for noisy data pairs to reflect their true correspondence degree.
experiments on three popular cross-modal matching datasets demonstrate that BiCro significantly improves the noise-robustness of various matching models.
arXiv Detail & Related papers (2023-03-22T09:33:50Z) - Cross-Domain Few-Shot Relation Extraction via Representation Learning
and Domain Adaptation [1.1602089225841632]
Few-shot relation extraction aims to recognize novel relations with few labeled sentences in each relation.
Previous metric-based few-shot relation extraction algorithms identify relationships by comparing the prototypes generated by the few labeled sentences embedding with the embeddings of the query sentences using a trained metric function.
We suggest learning more interpretable and efficient prototypes from prior knowledge and the intrinsic semantics of relations to extract new relations in various domains more effectively.
arXiv Detail & Related papers (2022-12-05T19:34:52Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - Modelling Neighbor Relation in Joint Space-Time Graph for Video
Correspondence Learning [53.74240452117145]
This paper presents a self-supervised method for learning reliable visual correspondence from unlabeled videos.
We formulate the correspondence as finding paths in a joint space-time graph, where nodes are grid patches sampled from frames, and are linked by two types of edges.
Our learned representation outperforms the state-of-the-art self-supervised methods on a variety of visual tasks.
arXiv Detail & Related papers (2021-09-28T05:40:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.