Semi-Autoregressive Transformer for Image Captioning
- URL: http://arxiv.org/abs/2106.09436v1
- Date: Thu, 17 Jun 2021 12:36:33 GMT
- Title: Semi-Autoregressive Transformer for Image Captioning
- Authors: Yuanen Zhou, Yong Zhang, Zhenzhen Hu, Meng Wang
- Abstract summary: We introduce a semi-autoregressive model for image captioning(dubbed as SATIC)
It keeps the autoregressive property in global but generates words parallelly in local.
Experiments on the MSCOCO image captioning benchmark show that SATIC can achieve a better trade-off without bells and whistles.
- Score: 17.533503295862808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current state-of-the-art image captioning models adopt autoregressive
decoders, \ie they generate each word by conditioning on previously generated
words, which leads to heavy latency during inference. To tackle this issue,
non-autoregressive image captioning models have recently been proposed to
significantly accelerate the speed of inference by generating all words in
parallel. However, these non-autoregressive models inevitably suffer from large
generation quality degradation since they remove words dependence excessively.
To make a better trade-off between speed and quality, we introduce a
semi-autoregressive model for image captioning~(dubbed as SATIC), which keeps
the autoregressive property in global but generates words parallelly in local.
Based on Transformer, there are only a few modifications needed to implement
SATIC. Extensive experiments on the MSCOCO image captioning benchmark show that
SATIC can achieve a better trade-off without bells and whistles. Code is
available at {\color{magenta}\url{https://github.com/YuanEZhou/satic}}.
Related papers
- Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [60.188309982690335]
We propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation.
By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding.
arXiv Detail & Related papers (2024-10-02T16:05:27Z) - Emage: Non-Autoregressive Text-to-Image Generation [63.347052548210236]
Non-autoregressive text-to-image models efficiently generate hundreds of image tokens in parallel.
Our model with 346M parameters generates an image of 256$times$256 with about one second on one V100 GPU.
arXiv Detail & Related papers (2023-12-22T10:01:54Z) - Bounding and Filling: A Fast and Flexible Framework for Image Captioning [5.810020749348207]
We introduce a fast and flexible framework for image captioning called BoFiCap based on bounding and filling techniques.
Our framework in a non-autoregressive manner achieves the state-of-the-art on task-specific metric CIDEr while speeding up 9.22x.
arXiv Detail & Related papers (2023-10-15T16:17:20Z) - Semantic-Conditional Diffusion Networks for Image Captioning [116.86677915812508]
We propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net)
In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence.
Experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task.
arXiv Detail & Related papers (2022-12-06T16:08:16Z) - Semi-Autoregressive Image Captioning [153.9658053662605]
Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner.
Non-autoregressive image captioning with continuous iterative refinement can achieve comparable performance to the autoregressive counterparts with a considerable acceleration.
We propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC) to make a better trade-off between performance and speed.
arXiv Detail & Related papers (2021-10-11T15:11:54Z) - Fast Sequence Generation with Multi-Agent Reinforcement Learning [40.75211414663022]
Non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel.
We propose a simple and efficient model for Non-Autoregressive sequence Generation (NAG) with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
On MSCOCO image captioning benchmark, our NAG method achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2021-01-24T12:16:45Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z) - Non-Autoregressive Image Captioning with Counterfactuals-Critical
Multi-Agent Learning [46.060954649681385]
We propose a Non-Autoregressive Image Captioning model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
Our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2020-05-10T15:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.