CaMEL: Mean Teacher Learning for Image Captioning
- URL: http://arxiv.org/abs/2202.10492v1
- Date: Mon, 21 Feb 2022 19:04:46 GMT
- Title: CaMEL: Mean Teacher Learning for Image Captioning
- Authors: Manuele Barraco, Matteo Stefanini, Marcella Cornia, Silvia
Cascianelli, Lorenzo Baraldi, Rita Cucchiara
- Abstract summary: We present CaMEL, a novel Transformer-based architecture for image captioning.
Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase.
Experimentally, we assess the effectiveness of the proposed solution on the COCO dataset and in conjunction with different visual feature extractors.
- Score: 47.9708610052655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Describing images in natural language is a fundamental step towards the
automatic modeling of connections between the visual and textual modalities. In
this paper we present CaMEL, a novel Transformer-based architecture for image
captioning. Our proposed approach leverages the interaction of two
interconnected language models that learn from each other during the training
phase. The interplay between the two language models follows a mean teacher
learning paradigm with knowledge distillation. Experimentally, we assess the
effectiveness of the proposed solution on the COCO dataset and in conjunction
with different visual feature extractors. When comparing with existing
proposals, we demonstrate that our model provides state-of-the-art caption
quality with a significantly reduced number of parameters. According to the
CIDEr metric, we obtain a new state of the art on COCO when training without
using external data. The source code and trained models are publicly available
at: https://github.com/aimagelab/camel.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Mug-STAN: Adapting Image-Language Pretrained Models for General Video
Understanding [47.97650346560239]
We propose Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN) to extend image-text model to diverse video tasks and video-text data.
Mug-STAN significantly improves adaptation of language-image pretrained models such as CLIP and CoCa at both video-text post-pretraining and finetuning stages.
arXiv Detail & Related papers (2023-11-25T17:01:38Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.