Plug-and-Play Regulators for Image-Text Matching
- URL: http://arxiv.org/abs/2303.13371v1
- Date: Thu, 23 Mar 2023 15:42:05 GMT
- Title: Plug-and-Play Regulators for Image-Text Matching
- Authors: Haiwen Diao, Ying Zhang, Wei Liu, Xiang Ruan, Huchuan Lu
- Abstract summary: Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching.
We develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations.
Experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models.
- Score: 76.28522712930668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploiting fine-grained correspondence and visual-semantic alignments has
shown great potential in image-text matching. Generally, recent approaches
first employ a cross-modal attention unit to capture latent region-word
interactions, and then integrate all the alignments to obtain the final
similarity. However, most of them adopt one-time forward association or
aggregation strategies with complex architectures or additional information,
while ignoring the regulation ability of network feedback. In this paper, we
develop two simple but quite effective regulators which efficiently encode the
message output to automatically contextualize and aggregate cross-modal
representations. Specifically, we propose (i) a Recurrent Correspondence
Regulator (RCR) which facilitates the cross-modal attention unit progressively
with adaptive attention factors to capture more flexible correspondence, and
(ii) a Recurrent Aggregation Regulator (RAR) which adjusts the aggregation
weights repeatedly to increasingly emphasize important alignments and dilute
unimportant ones. Besides, it is interesting that RCR and RAR are
plug-and-play: both of them can be incorporated into many frameworks based on
cross-modal interaction to obtain significant benefits, and their cooperation
achieves further improvements. Extensive experiments on MSCOCO and Flickr30K
datasets validate that they can bring an impressive and consistent R@1 gain on
multiple models, confirming the general effectiveness and generalization
ability of the proposed methods. Code and pre-trained models are available at:
https://github.com/Paranioar/RCAR.
Related papers
- Learning Partially Aligned Item Representation for Cross-Domain Sequential Recommendation [72.73379646418435]
Cross-domain sequential recommendation aims to uncover and transfer users' sequential preferences across domains.
misaligned item representations can potentially lead to sub-optimal sequential modeling and user representation alignment.
We propose a model-agnostic framework called textbfCross-domain item representation textbfAlignment for textbfCross-textbfDomain textbfSequential textbfRecommendation.
arXiv Detail & Related papers (2024-05-21T03:25:32Z) - Modality-Collaborative Transformer with Hybrid Feature Reconstruction
for Robust Emotion Recognition [35.15390769958969]
We propose a unified framework, Modality-Collaborative Transformer with Hybrid Feature Reconstruction (MCT-HFR)
MCT-HFR consists of a novel attention-based encoder which concurrently extracts and dynamically balances the intra- and inter-modality relations.
During model training, LFI leverages complete features as supervisory signals to recover local missing features, while GFA is designed to reduce the global semantic gap between pairwise complete and incomplete representations.
arXiv Detail & Related papers (2023-12-26T01:59:23Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Towards Generalizable Referring Image Segmentation via Target Prompt and
Visual Coherence [48.659338080020746]
Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions.
We present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above.
Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context.
arXiv Detail & Related papers (2023-12-01T09:31:24Z) - Connecting Multi-modal Contrastive Representations [50.26161419616139]
Multi-modal Contrastive Representation learning aims to encode different modalities into a semantically shared space.
This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR)
C-MCR achieves audio-visual state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks.
arXiv Detail & Related papers (2023-05-22T09:44:39Z) - Towards Lightweight Cross-domain Sequential Recommendation via External
Attention-enhanced Graph Convolution Network [7.1102362215550725]
Cross-domain Sequential Recommendation (CSR) depicts the evolution of behavior patterns for overlapped users by modeling their interactions from multiple domains.
We introduce a lightweight external attention-enhanced GCN-based framework to solve the above challenges, namely LEA-GCN.
To further alleviate the framework structure and aggregate the user-specific sequential pattern, we devise a novel dual-channel External Attention (EA) component.
arXiv Detail & Related papers (2023-02-07T03:06:29Z) - Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image.
We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model.
Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z) - CoADNet: Collaborative Aggregation-and-Distribution Networks for
Co-Salient Object Detection [91.91911418421086]
Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images.
One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships.
We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
arXiv Detail & Related papers (2020-11-10T04:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.