Related papers: Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning

Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning

URL: http://arxiv.org/abs/2208.06551v4
Date: Fri, 19 Jan 2024 02:42:20 GMT
Title: Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning
Authors: Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi
Abstract summary: We introduce a method called the Expansion mechanism that processes the input unconstrained by the number of elements in the sequence. By doing so, the model can learn more effectively compared to traditional attention-based approaches.
Score: 52.25026952905702
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce a method called the Expansion mechanism that processes the input unconstrained by the number of elements in the sequence. By doing so, the model can learn more effectively compared to traditional attention-based approaches. To support this claim, we design a novel architecture ExpansionNet v2 that achieved strong results on the MS COCO 2014 Image Captioning challenge and the State of the Art in its respective category, with a score of 143.7 CIDErD in the offline test split, 140.8 CIDErD in the online evaluation server and 72.9 AllCIDEr on the nocaps validation set. Additionally, we introduce an End to End training algorithm up to 2.8 times faster than established alternatives. Source code available at: https://github.com/jchenghu/ExpansionNet_v2

Related papers

Linear Alignment of Vision-language Models for Image Captioning [9.746397419479447]
We propose a more efficient training protocol that fits a linear mapping between image and text embeddings of CLIP. This bypasses the need for gradient computation and results in a lightweight captioning method called ReCap. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz, and MSRVTT.
arXiv Detail & Related papers (2023-07-10T17:59:21Z)
Dynamic Perceiver for Efficient Visual Recognition [87.08210214417309]
We propose Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure and the early classification task. A feature branch serves to extract image features, while a classification branch processes a latent code assigned for classification tasks. Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
arXiv Detail & Related papers (2023-06-20T03:00:22Z)
Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network [26.97153244517095]
We propose a network that only needs a single pass through the visual-language model for each input image. We first propose a novel network adaptation approach, termed patch severance, to restrict the harmful interference between the patch embeddings in the pre-trained visual encoder. We then propose classification anchor learning to encourage the network to spatially focus on more discriminative features for classification.
arXiv Detail & Related papers (2023-04-03T17:59:21Z)
Exploring Discrete Diffusion Models for Image Captioning [104.69608826164216]
We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. We propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training. With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO.
arXiv Detail & Related papers (2022-11-21T18:12:53Z)
Progressive Tree-Structured Prototype Network for End-to-End Image Captioning [74.8547752611337]
We propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN) PTSN is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics. Our method achieves a new state-of-the-art performance with 144.2% (single model) and 146.5% (ensemble of 4 models) CIDEr scores on Karpathy' split and 141.4% (c5) and 143.9% (c40) CIDEr scores on the official online test server.
arXiv Detail & Related papers (2022-11-17T11:04:00Z)
ExpansionNet: exploring the sequence length bottleneck in the Transformer for Image Captioning [0.0]
We propose a new method called Expansion Mechanism'' which transforms either dynamically or statically the input sequence into a new one featuring a different sequence length. We exploit such method and achieve competitive performances on the MS-COCO 2014 data set.
arXiv Detail & Related papers (2022-07-07T14:37:02Z)
End-to-End Supermask Pruning: Learning to Prune Image Captioning Models [17.00974730372399]
We show that an 80% to 95% sparse network can either match or outperform its dense counterpart. The code and pre-trained models for Up-Down and Object Relation Transformer are capable of achieving CIDEr scores >120 on the MS-COCO dataset.
arXiv Detail & Related papers (2021-10-07T09:34:00Z)
BiO-Net: Learning Recurrent Bi-directional Connections for Encoder-Decoder Architecture [82.64881585566825]
We present a novel Bi-directional O-shape network (BiO-Net) that reuses the building blocks in a recurrent manner without introducing any extra parameters. Our method significantly outperforms the vanilla U-Net as well as other state-of-the-art methods.
arXiv Detail & Related papers (2020-07-01T05:07:49Z)
ResNeSt: Split-Attention Networks [86.25490825631763]
We present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations. Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification.
arXiv Detail & Related papers (2020-04-19T20:40:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.