Efficient Modeling of Future Context for Image Captioning
- URL: http://arxiv.org/abs/2207.10897v1
- Date: Fri, 22 Jul 2022 06:21:43 GMT
- Title: Efficient Modeling of Future Context for Image Captioning
- Authors: Zhengcong Fei, Junshi Huang, Xiaoming Wei, Xiaolin Wei
- Abstract summary: Non-Autoregressive Image Captioning ( NAIC) can leverage two-side relation with modified mask operation.
Our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations.
- Score: 38.52032153180971
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing approaches to image captioning usually generate the sentence
word-by-word from left to right, with the constraint of conditioned on local
context including the given image and history generated words. There have been
many studies target to make use of global information during decoding, e.g.,
iterative refinement. However, it is still under-explored how to effectively
and efficiently incorporate the future context. To respond to this issue,
inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage
two-side relation with modified mask operation, we aim to graft this advance to
the conventional Autoregressive Image Captioning (AIC) model while maintaining
the inference efficiency without extra time cost. Specifically, AIC and NAIC
models are first trained combined with shared visual encoders, forcing the
visual encoder to contain sufficient and valid future context; then the AIC
model is encouraged to capture the causal dynamics of cross-layer interchanging
from NAIC model on its unconfident words, which follows a teacher-student
paradigm and optimized with the distribution calibration training objective.
Empirical evidences demonstrate that our proposed approach clearly surpass the
state-of-the-art baselines in both automatic metrics and human evaluations on
the MS COCO benchmark. The source code is available at:
https://github.com/feizc/Future-Caption.
Related papers
- Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization [44.008094698200026]
We propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO)
Our approach jointly learns and optimize a reward model that is distilled from a learnable captioning evaluator with high human correlation.
DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods.
arXiv Detail & Related papers (2024-08-26T18:00:33Z) - Exploring Stochastic Autoregressive Image Modeling for Visual
Representation [24.582376834198403]
We propose a novel autoregressive image modeling (named SAIM) by the two simple designs.
By introducing prediction and the parallel encoder-decoder, SAIM significantly improve the performance of autoregressive image modeling.
Our method achieves the best accuracy (83.9%) on the vanilla ViT-Base model among methods using only ImageNet-1K data.
arXiv Detail & Related papers (2022-12-03T13:04:29Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Semi-Autoregressive Image Captioning [153.9658053662605]
Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner.
Non-autoregressive image captioning with continuous iterative refinement can achieve comparable performance to the autoregressive counterparts with a considerable acceleration.
We propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC) to make a better trade-off between performance and speed.
arXiv Detail & Related papers (2021-10-11T15:11:54Z) - Cross Modification Attention Based Deliberation Model for Image
Captioning [11.897899189552318]
We propose a universal two-pass decoding framework for image captioning.
A single-pass decoding based model first generates a draft caption according to an input image.
A Deliberation Model then performs the polishing process to refine the draft caption to a better image description.
arXiv Detail & Related papers (2021-09-17T08:38:08Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Non-Autoregressive Image Captioning with Counterfactuals-Critical
Multi-Agent Learning [46.060954649681385]
We propose a Non-Autoregressive Image Captioning model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
Our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2020-05-10T15:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.