Bounding and Filling: A Fast and Flexible Framework for Image Captioning
- URL: http://arxiv.org/abs/2310.09876v1
- Date: Sun, 15 Oct 2023 16:17:20 GMT
- Title: Bounding and Filling: A Fast and Flexible Framework for Image Captioning
- Authors: Zheng Ma, Changxin Wang, Bo Huang, Zixuan Zhu and Jianbing Zhang
- Abstract summary: We introduce a fast and flexible framework for image captioning called BoFiCap based on bounding and filling techniques.
Our framework in a non-autoregressive manner achieves the state-of-the-art on task-specific metric CIDEr while speeding up 9.22x.
- Score: 5.810020749348207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most image captioning models following an autoregressive manner suffer from
significant inference latency. Several models adopted a non-autoregressive
manner to speed up the process. However, the vanilla non-autoregressive manner
results in subpar performance, since it generates all words simultaneously,
which fails to capture the relationships between words in a description. The
semi-autoregressive manner employs a partially parallel method to preserve
performance, but it sacrifices inference speed. In this paper, we introduce a
fast and flexible framework for image captioning called BoFiCap based on
bounding and filling techniques. The BoFiCap model leverages the inherent
characteristics of image captioning tasks to pre-define bounding boxes for
image regions and their relationships. Subsequently, the BoFiCap model fills
corresponding words in each box using two-generation manners. Leveraging the
box hints, our filling process allows each word to better perceive other words.
Additionally, our model offers flexible image description generation: 1) by
employing different generation manners based on speed or performance
requirements, 2) producing varied sentences based on user-specified boxes.
Experimental evaluations on the MS-COCO benchmark dataset demonstrate that our
framework in a non-autoregressive manner achieves the state-of-the-art on
task-specific metric CIDEr (125.6) while speeding up 9.22x than the baseline
model with an autoregressive manner; in a semi-autoregressive manner, our
method reaches 128.4 on CIDEr while a 3.69x speedup. Our code and data is
available at https://github.com/ChangxinWang/BoFiCap.
Related papers
- Emage: Non-Autoregressive Text-to-Image Generation [63.347052548210236]
Non-autoregressive text-to-image models efficiently generate hundreds of image tokens in parallel.
Our model with 346M parameters generates an image of 256$times$256 with about one second on one V100 GPU.
arXiv Detail & Related papers (2023-12-22T10:01:54Z) - Exploring Discrete Diffusion Models for Image Captioning [104.69608826164216]
We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility.
We propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training.
With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO.
arXiv Detail & Related papers (2022-11-21T18:12:53Z) - Zero-Shot Video Captioning with Evolving Pseudo-Tokens [79.16706829968673]
We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model and the CLIP image-text matching model.
The matching score is used to steer the language model toward generating a sentence that has a high average matching score to a subset of the video frames.
Our experiments show that the generated captions are coherent and display a broad range of real-world knowledge.
arXiv Detail & Related papers (2022-07-22T14:19:31Z) - Semi-Autoregressive Image Captioning [153.9658053662605]
Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner.
Non-autoregressive image captioning with continuous iterative refinement can achieve comparable performance to the autoregressive counterparts with a considerable acceleration.
We propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC) to make a better trade-off between performance and speed.
arXiv Detail & Related papers (2021-10-11T15:11:54Z) - Semi-Autoregressive Transformer for Image Captioning [17.533503295862808]
We introduce a semi-autoregressive model for image captioning(dubbed as SATIC)
It keeps the autoregressive property in global but generates words parallelly in local.
Experiments on the MSCOCO image captioning benchmark show that SATIC can achieve a better trade-off without bells and whistles.
arXiv Detail & Related papers (2021-06-17T12:36:33Z) - Fast Sequence Generation with Multi-Agent Reinforcement Learning [40.75211414663022]
Non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel.
We propose a simple and efficient model for Non-Autoregressive sequence Generation (NAG) with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
On MSCOCO image captioning benchmark, our NAG method achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2021-01-24T12:16:45Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z) - Non-Autoregressive Image Captioning with Counterfactuals-Critical
Multi-Agent Learning [46.060954649681385]
We propose a Non-Autoregressive Image Captioning model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
Our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2020-05-10T15:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.