Cross Modification Attention Based Deliberation Model for Image
Captioning
- URL: http://arxiv.org/abs/2109.08411v1
- Date: Fri, 17 Sep 2021 08:38:08 GMT
- Title: Cross Modification Attention Based Deliberation Model for Image
Captioning
- Authors: Zheng Lian, Yanan Zhang, Haichang Li, Rui Wang, Xiaohui Hu
- Abstract summary: We propose a universal two-pass decoding framework for image captioning.
A single-pass decoding based model first generates a draft caption according to an input image.
A Deliberation Model then performs the polishing process to refine the draft caption to a better image description.
- Score: 11.897899189552318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The conventional encoder-decoder framework for image captioning generally
adopts a single-pass decoding process, which predicts the target descriptive
sentence word by word in temporal order. Despite the great success of this
framework, it still suffers from two serious disadvantages. Firstly, it is
unable to correct the mistakes in the predicted words, which may mislead the
subsequent prediction and result in error accumulation problem. Secondly, such
a framework can only leverage the already generated words but not the possible
future words, and thus lacks the ability of global planning on linguistic
information. To overcome these limitations, we explore a universal two-pass
decoding framework, where a single-pass decoding based model serving as the
Drafting Model first generates a draft caption according to an input image, and
a Deliberation Model then performs the polishing process to refine the draft
caption to a better image description. Furthermore, inspired from the
complementarity between different modalities, we propose a novel Cross
Modification Attention (CMA) module to enhance the semantic expression of the
image features and filter out error information from the draft captions. We
integrate CMA with the decoder of our Deliberation Model and name it as Cross
Modification Attention based Deliberation Model (CMA-DM). We train our proposed
framework by jointly optimizing all trainable components from scratch with a
trade-off coefficient. Experiments on MS COCO dataset demonstrate that our
approach obtains significant improvements over single-pass decoding baselines
and achieves competitive performances compared with other state-of-the-art
two-pass decoding based methods.
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Corner-to-Center Long-range Context Model for Efficient Learned Image
Compression [70.0411436929495]
In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations.
We propose the textbfCorner-to-Center transformer-based Context Model (C$3$M) designed to enhance context and latent predictions.
In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder.
arXiv Detail & Related papers (2023-11-29T21:40:28Z) - RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff.
In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt.
The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z) - Eliminating Contextual Prior Bias for Semantic Image Editing via
Dual-Cycle Diffusion [35.95513392917737]
A novel approach called Dual-Cycle Diffusion generates an unbiased mask to guide image editing.
Our experiments demonstrate the effectiveness of the proposed method, as it significantly improves the D-CLIP score from 0.272 to 0.283.
arXiv Detail & Related papers (2023-02-05T14:30:22Z) - Efficient Modeling of Future Context for Image Captioning [38.52032153180971]
Non-Autoregressive Image Captioning ( NAIC) can leverage two-side relation with modified mask operation.
Our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations.
arXiv Detail & Related papers (2022-07-22T06:21:43Z) - Semi-Autoregressive Image Captioning [153.9658053662605]
Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner.
Non-autoregressive image captioning with continuous iterative refinement can achieve comparable performance to the autoregressive counterparts with a considerable acceleration.
We propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC) to make a better trade-off between performance and speed.
arXiv Detail & Related papers (2021-10-11T15:11:54Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Fusion Models for Improved Visual Captioning [18.016295296424413]
This paper proposes a generic multimodal model fusion framework for caption generation and emendation.
We employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM) with a visual captioning model, viz. Show, Attend, and Tell.
Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline.
arXiv Detail & Related papers (2020-10-28T21:55:25Z) - Toward Interpretability of Dual-Encoder Models for Dialogue Response
Suggestions [18.117115200484708]
We present an attentive dual encoder model that includes an attention mechanism on top of the extracted word-level features from two encoders.
We design a novel regularization loss to minimize the mutual information between unimportant words and desired labels.
Experiments demonstrate the effectiveness of the proposed model in terms of better Recall@1 accuracy and visualized interpretability.
arXiv Detail & Related papers (2020-03-02T21:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.