Semi-Autoregressive Image Captioning
- URL: http://arxiv.org/abs/2110.05342v2
- Date: Wed, 13 Oct 2021 07:35:53 GMT
- Title: Semi-Autoregressive Image Captioning
- Authors: Xu Yan, Zhengcong Fei, Zekang Li, Shuhui Wang, Qingming Huang, Qi Tian
- Abstract summary: Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner.
Non-autoregressive image captioning with continuous iterative refinement can achieve comparable performance to the autoregressive counterparts with a considerable acceleration.
We propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC) to make a better trade-off between performance and speed.
- Score: 153.9658053662605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current state-of-the-art approaches for image captioning typically adopt an
autoregressive manner, i.e., generating descriptions word by word, which
suffers from slow decoding issue and becomes a bottleneck in real-time
applications. Non-autoregressive image captioning with continuous iterative
refinement, which eliminates the sequential dependence in a sentence
generation, can achieve comparable performance to the autoregressive
counterparts with a considerable acceleration. Nevertheless, based on a
well-designed experiment, we empirically proved that iteration times can be
effectively reduced when providing sufficient prior knowledge for the language
decoder. Towards that end, we propose a novel two-stage framework, referred to
as Semi-Autoregressive Image Captioning (SAIC), to make a better trade-off
between performance and speed. The proposed SAIC model maintains autoregressive
property in global but relieves it in local. Specifically, SAIC model first
jumpily generates an intermittent sequence in an autoregressive manner, that
is, it predicts the first word in every word group in order. Then, with the
help of the partially deterministic prior information and image features, SAIC
model non-autoregressively fills all the skipped words with one iteration.
Experimental results on the MS COCO benchmark demonstrate that our SAIC model
outperforms the preceding non-autoregressive image captioning models while
obtaining a competitive inference speedup. Code is available at
https://github.com/feizc/SAIC.
Related papers
- Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models.
We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [60.188309982690335]
We propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation.
By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding.
arXiv Detail & Related papers (2024-10-02T16:05:27Z) - NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition [80.22784377150465]
Handwritten Mathematical Expression Recognition (HMER) has gained considerable attention in pattern recognition for its diverse applications in document understanding.
This paper makes the first attempt to build a novel bottom-up Non-AutoRegressive Modeling approach for HMER, called NAMER.
NAMER comprises a Visual Aware Tokenizer (VAT) and a Parallel Graph (PGD)
arXiv Detail & Related papers (2024-07-16T04:52:39Z) - Bounding and Filling: A Fast and Flexible Framework for Image Captioning [5.810020749348207]
We introduce a fast and flexible framework for image captioning called BoFiCap based on bounding and filling techniques.
Our framework in a non-autoregressive manner achieves the state-of-the-art on task-specific metric CIDEr while speeding up 9.22x.
arXiv Detail & Related papers (2023-10-15T16:17:20Z) - Exploring Discrete Diffusion Models for Image Captioning [104.69608826164216]
We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility.
We propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training.
With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO.
arXiv Detail & Related papers (2022-11-21T18:12:53Z) - Efficient Modeling of Future Context for Image Captioning [38.52032153180971]
Non-Autoregressive Image Captioning ( NAIC) can leverage two-side relation with modified mask operation.
Our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations.
arXiv Detail & Related papers (2022-07-22T06:21:43Z) - Semi-Autoregressive Transformer for Image Captioning [17.533503295862808]
We introduce a semi-autoregressive model for image captioning(dubbed as SATIC)
It keeps the autoregressive property in global but generates words parallelly in local.
Experiments on the MSCOCO image captioning benchmark show that SATIC can achieve a better trade-off without bells and whistles.
arXiv Detail & Related papers (2021-06-17T12:36:33Z) - Fast Sequence Generation with Multi-Agent Reinforcement Learning [40.75211414663022]
Non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel.
We propose a simple and efficient model for Non-Autoregressive sequence Generation (NAG) with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
On MSCOCO image captioning benchmark, our NAG method achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2021-01-24T12:16:45Z) - Non-Autoregressive Image Captioning with Counterfactuals-Critical
Multi-Agent Learning [46.060954649681385]
We propose a Non-Autoregressive Image Captioning model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
Our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2020-05-10T15:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.