Enhanced Modality Transition for Image Captioning
- URL: http://arxiv.org/abs/2102.11526v1
- Date: Tue, 23 Feb 2021 07:20:12 GMT
- Title: Enhanced Modality Transition for Image Captioning
- Authors: Ziwei Wang, Yadan Luo and Zi Huang
- Abstract summary: We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
- Score: 51.72997126838352
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning model is a cross-modality knowledge discovery task, which
targets at automatically describing an image with an informative and coherent
sentence. To generate the captions, the previous encoder-decoder frameworks
directly forward the visual vectors to the recurrent language model, forcing
the recurrent units to generate a sentence based on the visual features.
Although these sentences are generally readable, they still suffer from the
lack of details and highlights, due to the fact that the substantial gap
between the image and text modalities is not sufficiently addressed. In this
work, we explicitly build a Modality Transition Module (MTM) to transfer visual
features into semantic representations before forwarding them to the language
model. During the training phase, the modality transition network is optimised
by the proposed modality loss, which compares the generated preliminary textual
encodings with the target sentence vectors from a pre-trained text
auto-encoder. In this way, the visual vectors are transited into the textual
subspace for more contextual and precise language generation. The novel MTM can
be incorporated into most of the existing methods. Extensive experiments have
been conducted on the MS-COCO dataset demonstrating the effectiveness of the
proposed framework, improving the performance by 3.4% comparing to the
state-of-the-arts.
Related papers
- CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Masked Visual Reconstruction in Language Semantic Space [38.43966132249977]
Masked visual Reconstruction In Language semantic Space (RILS) pre-training framework is presented.
RILS transforms vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets.
Our method exhibits advanced transferability on downstream classification, detection, and segmentation.
arXiv Detail & Related papers (2023-01-17T15:32:59Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.