Exploiting Multiple Sequence Lengths in Fast End to End Training for
Image Captioning
- URL: http://arxiv.org/abs/2208.06551v4
- Date: Fri, 19 Jan 2024 02:42:20 GMT
- Title: Exploiting Multiple Sequence Lengths in Fast End to End Training for
Image Captioning
- Authors: Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi
- Abstract summary: We introduce a method called the Expansion mechanism that processes the input unconstrained by the number of elements in the sequence.
By doing so, the model can learn more effectively compared to traditional attention-based approaches.
- Score: 52.25026952905702
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce a method called the Expansion mechanism that processes the input
unconstrained by the number of elements in the sequence. By doing so, the model
can learn more effectively compared to traditional attention-based approaches.
To support this claim, we design a novel architecture ExpansionNet v2 that
achieved strong results on the MS COCO 2014 Image Captioning challenge and the
State of the Art in its respective category, with a score of 143.7 CIDErD in
the offline test split, 140.8 CIDErD in the online evaluation server and 72.9
AllCIDEr on the nocaps validation set. Additionally, we introduce an End to End
training algorithm up to 2.8 times faster than established alternatives. Source
code available at: https://github.com/jchenghu/ExpansionNet_v2
Related papers
- Linear Alignment of Vision-language Models for Image Captioning [9.746397419479447]
We propose a more efficient training protocol that fits a linear mapping between image and text embeddings of CLIP.
This bypasses the need for gradient computation and results in a lightweight captioning method called ReCap.
We evaluate ReCap on MS-COCO, Flickr30k, VizWiz, and MSRVTT.
arXiv Detail & Related papers (2023-07-10T17:59:21Z) - Dynamic Perceiver for Efficient Visual Recognition [87.08210214417309]
We propose Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure and the early classification task.
A feature branch serves to extract image features, while a classification branch processes a latent code assigned for classification tasks.
Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
arXiv Detail & Related papers (2023-06-20T03:00:22Z) - Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network [26.97153244517095]
We propose a network that only needs a single pass through the visual-language model for each input image.
We first propose a novel network adaptation approach, termed patch severance, to restrict the harmful interference between the patch embeddings in the pre-trained visual encoder.
We then propose classification anchor learning to encourage the network to spatially focus on more discriminative features for classification.
arXiv Detail & Related papers (2023-04-03T17:59:21Z) - Exploring Discrete Diffusion Models for Image Captioning [104.69608826164216]
We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility.
We propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training.
With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO.
arXiv Detail & Related papers (2022-11-21T18:12:53Z) - Progressive Tree-Structured Prototype Network for End-to-End Image
Captioning [74.8547752611337]
We propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN)
PTSN is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics.
Our method achieves a new state-of-the-art performance with 144.2% (single model) and 146.5% (ensemble of 4 models) CIDEr scores on Karpathy' split and 141.4% (c5) and 143.9% (c40) CIDEr scores on the official online test server.
arXiv Detail & Related papers (2022-11-17T11:04:00Z) - ExpansionNet: exploring the sequence length bottleneck in the
Transformer for Image Captioning [0.0]
We propose a new method called Expansion Mechanism'' which transforms either dynamically or statically the input sequence into a new one featuring a different sequence length.
We exploit such method and achieve competitive performances on the MS-COCO 2014 data set.
arXiv Detail & Related papers (2022-07-07T14:37:02Z) - End-to-End Supermask Pruning: Learning to Prune Image Captioning Models [17.00974730372399]
We show that an 80% to 95% sparse network can either match or outperform its dense counterpart.
The code and pre-trained models for Up-Down and Object Relation Transformer are capable of achieving CIDEr scores >120 on the MS-COCO dataset.
arXiv Detail & Related papers (2021-10-07T09:34:00Z) - BiO-Net: Learning Recurrent Bi-directional Connections for
Encoder-Decoder Architecture [82.64881585566825]
We present a novel Bi-directional O-shape network (BiO-Net) that reuses the building blocks in a recurrent manner without introducing any extra parameters.
Our method significantly outperforms the vanilla U-Net as well as other state-of-the-art methods.
arXiv Detail & Related papers (2020-07-01T05:07:49Z) - ResNeSt: Split-Attention Networks [86.25490825631763]
We present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations.
Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification.
arXiv Detail & Related papers (2020-04-19T20:40:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.