Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
- URL: http://arxiv.org/abs/2304.05173v1
- Date: Tue, 11 Apr 2023 12:12:05 GMT
- Title: Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
- Authors: Ahmet Iscen, Alireza Fathi, Cordelia Schmid
- Abstract summary: We introduce an attention-based memory module, which learns the importance of each retrieved example from the memory.
Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query.
We show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
- Score: 68.63453336523318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval augmented models are becoming increasingly popular for computer
vision tasks after their recent success in NLP problems. The goal is to enhance
the recognition capabilities of the model by retrieving similar examples for
the visual input from an external memory set. In this work, we introduce an
attention-based memory module, which learns the importance of each retrieved
example from the memory. Compared to existing approaches, our method removes
the influence of the irrelevant retrieved examples, and retains those that are
beneficial to the input query. We also thoroughly study various ways of
constructing the memory dataset. Our experiments show the benefit of using a
massive-scale memory dataset of 1B image-text pairs, and demonstrate the
performance of different memory representations. We evaluate our method in
three different classification tasks, namely long-tailed recognition, learning
with noisy labels, and fine-grained classification, and show that it achieves
state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
Related papers
- Learning from Memory: Non-Parametric Memory Augmented Self-Supervised Learning of Visual Features [6.096888891865663]
The proposed method involves augmenting a neural network with a memory component to compare current image views with previously encountered concepts.
We benchmark our method on many vision tasks, such as linear, transfer learning, low-shot classification, and image retrieval on many datasets.
The experimental results consolidate the effectiveness of the proposed approach in achieving stable SSL training without additional regularizers.
arXiv Detail & Related papers (2024-07-03T06:46:08Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Transformer based Multitask Learning for Image Captioning and Object
Detection [13.340784876489927]
This work introduces a novel multitask learning framework that combines image captioning and object detection into a joint model.
We propose TICOD, Transformer-based Image Captioning and Object detection model for jointly training both tasks.
Our model outperforms the baselines from image captioning literature by achieving a 3.65% improvement in BERTScore.
arXiv Detail & Related papers (2024-03-10T19:31:13Z) - Generative Cross-Modal Retrieval: Memorizing Images in Multimodal
Language Models for Retrieval and Beyond [99.73306923465424]
We introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images.
By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches.
arXiv Detail & Related papers (2024-02-16T16:31:46Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Saliency Guided Experience Packing for Replay in Continual Learning [6.417011237981518]
We propose a new approach for experience replay, where we select the past experiences by looking at the saliency maps.
While learning a new task, we replay these memory patches with appropriate zero-padding to remind the model about its past decisions.
arXiv Detail & Related papers (2021-09-10T15:54:58Z) - Memory Wrap: a Data-Efficient and Interpretable Extension to Image
Classification Models [9.848884631714451]
Memory Wrap is a plug-and-play extension to any image classification model.
It improves both data-efficiency and model interpretability, adopting a content-attention mechanism.
We show that Memory Wrap outperforms standard classifiers when it learns from a limited set of data.
arXiv Detail & Related papers (2021-06-01T07:24:19Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Distilling Localization for Self-Supervised Representation Learning [82.79808902674282]
Contrastive learning has revolutionized unsupervised representation learning.
Current contrastive models are ineffective at localizing the foreground object.
We propose a data-driven approach for learning in variance to backgrounds.
arXiv Detail & Related papers (2020-04-14T16:29:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.