IMRAM: Iterative Matching with Recurrent Attention Memory for
Cross-Modal Image-Text Retrieval
- URL: http://arxiv.org/abs/2003.03772v1
- Date: Sun, 8 Mar 2020 12:24:41 GMT
- Title: IMRAM: Iterative Matching with Recurrent Attention Memory for
Cross-Modal Image-Text Retrieval
- Authors: Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, Jungong Han
- Abstract summary: Existing methods leverage the attention mechanism to explore such correspondence in a fine-grained manner.
It may be difficult to optimally capture such sophisticated correspondences in existing methods.
We propose an Iterative Matching with Recurrent Attention Memory (IMRAM) method, in which correspondences are captured with multiple steps of alignments.
- Score: 105.77562776008459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Enabling bi-directional retrieval of images and texts is important for
understanding the correspondence between vision and language. Existing methods
leverage the attention mechanism to explore such correspondence in a
fine-grained manner. However, most of them consider all semantics equally and
thus align them uniformly, regardless of their diverse complexities. In fact,
semantics are diverse (i.e. involving different kinds of semantic concepts),
and humans usually follow a latent structure to combine them into
understandable languages. It may be difficult to optimally capture such
sophisticated correspondences in existing methods. In this paper, to address
such a deficiency, we propose an Iterative Matching with Recurrent Attention
Memory (IMRAM) method, in which correspondences between images and texts are
captured with multiple steps of alignments. Specifically, we introduce an
iterative matching scheme to explore such fine-grained correspondence
progressively. A memory distillation unit is used to refine alignment knowledge
from early steps to later ones. Experiment results on three benchmark datasets,
i.e. Flickr8K, Flickr30K, and MS COCO, show that our IMRAM achieves
state-of-the-art performance, well demonstrating its effectiveness. Experiments
on a practical business advertisement dataset, named \Ads{}, further validates
the applicability of our method in practical scenarios.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z) - Improving Image Recognition by Retrieving from Web-Scale Image-Text Data [68.63453336523318]
We introduce an attention-based memory module, which learns the importance of each retrieved example from the memory.
Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query.
We show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
arXiv Detail & Related papers (2023-04-11T12:12:05Z) - CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [56.58365347854647]
We introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP.
Our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders.
arXiv Detail & Related papers (2023-03-21T12:28:21Z) - Semantic-Aware Fine-Grained Correspondence [8.29030327276322]
We propose to learn semantic-aware fine-grained correspondence using image-level self-supervised methods.
We design a pixel-level self-supervised learning objective which specifically targets fine-grained correspondence.
Our method surpasses previous state-of-the-art self-supervised methods using convolutional networks on a variety of visual correspondence tasks.
arXiv Detail & Related papers (2022-07-21T12:51:41Z) - Knowledge Mining with Scene Text for Fine-Grained Recognition [53.74297368412834]
We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image.
We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification.
Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
arXiv Detail & Related papers (2022-03-27T05:54:00Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Is An Image Worth Five Sentences? A New Look into Semantics for
Image-Text Matching [10.992151305603267]
We propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance.
We incorporate a novel strategy that uses an image captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be optimized in a standard triplet loss.
arXiv Detail & Related papers (2021-10-06T09:54:28Z) - Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with
Adversarial Discriminative Domain Regularization [21.904563910555368]
We propose a novel learning framework to construct a set of discriminative data domains within each image-text pairs.
Our approach can generally improve the learning efficiency and the performance of existing metrics learning frameworks.
arXiv Detail & Related papers (2020-10-23T01:48:37Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.