With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning
- URL: http://arxiv.org/abs/2308.12383v1
- Date: Wed, 23 Aug 2023 18:53:00 GMT
- Title: With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning
- Authors: Manuele Barraco, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita
Cucchiara
- Abstract summary: We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
- Score: 47.96387857237473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning, like many tasks involving vision and language, currently
relies on Transformer-based architectures for extracting the semantics in an
image and translating it into linguistically coherent descriptions. Although
successful, the attention operator only considers a weighted summation of
projections of the current input sample, therefore ignoring the relevant
semantic information which can come from the joint observation of other
samples. In this paper, we devise a network which can perform attention over
activations obtained while processing other training samples, through a
prototypical memory model. Our memory models the distribution of past keys and
values through the definition of prototype vectors which are both
discriminative and compact. Experimentally, we assess the performance of the
proposed model on the COCO dataset, in comparison with carefully designed
baselines and state-of-the-art approaches, and by investigating the role of
each of the proposed components. We demonstrate that our proposal can increase
the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when
training in cross-entropy only and when fine-tuning with self-critical sequence
training. Source code and trained models are available at:
https://github.com/aimagelab/PMA-Net.
Related papers
- Correlation Weighted Prototype-based Self-Supervised One-Shot Segmentation of Medical Images [12.365801596593936]
Medical image segmentation is one of the domains where sufficient annotated data is not available.
We propose a prototype-based self-supervised one-way one-shot learning framework using pseudo-labels generated from superpixels.
We show that the proposed simple but potent framework performs at par with the state-of-the-art methods.
arXiv Detail & Related papers (2024-08-12T15:38:51Z) - Pre-Trained Vision-Language Models as Partial Annotators [40.89255396643592]
Pre-trained vision-language models learn massive data to model unified representations of images and natural languages.
In this paper, we investigate a novel "pre-trained annotating - weakly-supervised learning" paradigm for pre-trained model application and experiment on image classification tasks.
arXiv Detail & Related papers (2024-05-23T17:17:27Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Correlational Image Modeling for Self-Supervised Visual Pre-Training [81.82907503764775]
Correlational Image Modeling is a novel and surprisingly effective approach to self-supervised visual pre-training.
Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task.
arXiv Detail & Related papers (2023-03-22T15:48:23Z) - ST-KeyS: Self-Supervised Transformer for Keyword Spotting in Historical
Handwritten Documents [3.9688530261646653]
Keywords spotting (KWS) in historical documents is an important tool for the initial exploration of digitized collections.
We propose ST-KeyS, a masked auto-encoder model based on vision transformers where the pretraining stage is based on the mask-and-predict paradigm.
In the fine-tuning stage, the pre-trained encoder is integrated into a siamese neural network model that is fine-tuned to improve feature embedding from the input images.
arXiv Detail & Related papers (2023-03-06T13:39:41Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Rethinking Semantic Segmentation: A Prototype View [126.59244185849838]
We present a nonparametric semantic segmentation model based on non-learnable prototypes.
Our framework yields compelling results over several datasets.
We expect this work will provoke a rethink of the current de facto semantic segmentation model design.
arXiv Detail & Related papers (2022-03-28T21:15:32Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.