Related papers: Learnable Pillar-based Re-ranking for Image-Text Retrieval

Learnable Pillar-based Re-ranking for Image-Text Retrieval

URL: http://arxiv.org/abs/2304.12570v1
Date: Tue, 25 Apr 2023 04:33:27 GMT
Title: Learnable Pillar-based Re-ranking for Image-Text Retrieval
Authors: Leigang Qu, Meng Liu, Wenjie Wang, Zhedong Zheng, Liqiang Nie, Tat-Seng Chua
Abstract summary: Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks. We propose a novel learnable pillar-based re-ranking paradigm for image-text retrieval.
Score: 119.9979224297237
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Prior work usually focuses on the pairwise relations (i.e., whether a data sample matches another) but ignores the higher-order neighbor relations (i.e., a matching structure among multiple data samples). Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks. However, it is ineffective to directly extend existing re-ranking algorithms to image-text retrieval. In this paper, we analyze the reason from four perspectives, i.e., generalization, flexibility, sparsity, and asymmetry, and propose a novel learnable pillar-based re-ranking paradigm. Concretely, we first select top-ranked intra- and inter-modal neighbors as pillars, and then reconstruct data samples with the neighbor relations between them and the pillars. In this way, each sample can be mapped into a multimodal pillar space only using similarities, ensuring generalization. After that, we design a neighbor-aware graph reasoning module to flexibly exploit the relations and excavate the sparse positive items within a neighborhood. We also present a structure alignment constraint to promote cross-modal collaboration and align the asymmetric modalities. On top of various base backbones, we carry out extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, demonstrating the effectiveness, superiority, generalization, and transferability of our proposed re-ranking paradigm.

Related papers

Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts [67.67746334493302]
Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet they often rely on external context to handle complex tasks. We propose a tri-encoder sequential retriever that models this process as a Markov Decision Process (MDP) We show that our method consistently and significantly outperforms baselines, underscoring the importance of explicitly modeling inter-example dependencies.
arXiv Detail & Related papers (2025-04-15T17:35:56Z)
SEG:Seeds-Enhanced Iterative Refinement Graph Neural Network for Entity Alignment [13.487673375206276]
This paper presents a soft label propagation framework that integrates multi-source data and iterative seed enhancement. A bidirectional weighted joint loss function is implemented, which reduces the distance between positive samples and differentially processes negative samples. Our method outperforms existing semi-supervised approaches, as evidenced by superior results on multiple datasets.
arXiv Detail & Related papers (2024-10-28T04:50:46Z)
Multimodal Relational Triple Extraction with Query-based Entity Object Transformer [20.97497765985682]
Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge. We propose Multimodal Entity-Object Triple Extraction, which aims to extract all triples (entity, relation, object region) from image-text pairs. We also propose QEOT, a query-based model with a selective attention mechanism to dynamically explore the interaction and fusion of textual and visual information.
arXiv Detail & Related papers (2024-08-16T12:43:38Z)
Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching. An anchor branch is first trained to provide insights into the data properties. A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z)
DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking [31.15972952813689]
We propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks. DRIN explicitly models four different types of alignment between a mention and entity and builds a dynamic Graph Convolutional Network (GCN) to dynamically select the corresponding alignment relations for different input samples. Experiments on two datasets show that DRIN outperforms state-of-the-art methods by a large margin, demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2023-10-09T10:21:42Z)
Hierarchical Matching and Reasoning for Multi-Query Image Retrieval [113.44470784756308]
We propose a novel Hierarchical Matching and Reasoning Network (HMRN) for Multi-Query Image Retrieval (MQIR) It disentangles MQIR into three hierarchical semantic representations, which is responsible to capture fine-grained local details, contextual global scopes, and high-level inherent correlations. Our HMRN substantially surpasses the current state-of-the-art methods.
arXiv Detail & Related papers (2023-06-26T07:03:56Z)
Adaptive Similarity Bootstrapping for Self-Distillation based Representation Learning [40.94237853380154]
NNCLR goes beyond the cross-view paradigm and uses positive pairs from different images obtained via nearest neighbor bootstrapping in a contrastive setting. We empirically show that as opposed to the contrastive learning setting which relies on negative samples, incorporating nearest neighbor bootstrapping in a self-distillation scheme can lead to a performance drop or even collapse. We propose to adaptively bootstrap neighbors based on the estimated quality of the latent space.
arXiv Detail & Related papers (2023-03-23T18:40:17Z)
BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency [66.8685113725007]
BiCro aims to estimate soft labels for noisy data pairs to reflect their true correspondence degree. experiments on three popular cross-modal matching datasets demonstrate that BiCro significantly improves the noise-robustness of various matching models.
arXiv Detail & Related papers (2023-03-22T09:33:50Z)
Cross-Domain Few-Shot Relation Extraction via Representation Learning and Domain Adaptation [1.1602089225841632]
Few-shot relation extraction aims to recognize novel relations with few labeled sentences in each relation. Previous metric-based few-shot relation extraction algorithms identify relationships by comparing the prototypes generated by the few labeled sentences embedding with the embeddings of the query sentences using a trained metric function. We suggest learning more interpretable and efficient prototypes from prior knowledge and the intrinsic semantics of relations to extract new relations in various domains more effectively.
arXiv Detail & Related papers (2022-12-05T19:34:52Z)
ReSel: N-ary Relation Extraction from Scientific Text and Tables by Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles. Our proposed method ReSel decomposes this task into a two-stage procedure. Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z)
Modelling Neighbor Relation in Joint Space-Time Graph for Video Correspondence Learning [53.74240452117145]
This paper presents a self-supervised method for learning reliable visual correspondence from unlabeled videos. We formulate the correspondence as finding paths in a joint space-time graph, where nodes are grid patches sampled from frames, and are linked by two types of edges. Our learned representation outperforms the state-of-the-art self-supervised methods on a variety of visual tasks.
arXiv Detail & Related papers (2021-09-28T05:40:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.