TCIC: Theme Concepts Learning Cross Language and Vision for Image
Captioning
- URL: http://arxiv.org/abs/2106.10936v1
- Date: Mon, 21 Jun 2021 09:12:55 GMT
- Title: TCIC: Theme Concepts Learning Cross Language and Vision for Image
Captioning
- Authors: Zhihao Fan, Zhongyu Wei, Siyuan Wang, Ruize Wang, Zejun Li, Haijun
Shan, Xuanjing Huang
- Abstract summary: We propose a Theme Concepts extended Image Captioning framework that incorporates theme concepts to represent high-level cross-modality semantics.
Considering that theme concepts can be learned from both images and captions, we propose two settings for their representations learning based on TTN.
- Score: 50.30918954390918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing research for image captioning usually represents an image using a
scene graph with low-level facts (objects and relations) and fails to capture
the high-level semantics. In this paper, we propose a Theme Concepts extended
Image Captioning (TCIC) framework that incorporates theme concepts to represent
high-level cross-modality semantics. In practice, we model theme concepts as
memory vectors and propose Transformer with Theme Nodes (TTN) to incorporate
those vectors for image captioning. Considering that theme concepts can be
learned from both images and captions, we propose two settings for their
representations learning based on TTN. On the vision side, TTN is configured to
take both scene graph based features and theme concepts as input for visual
representation learning. On the language side, TTN is configured to take both
captions and theme concepts as input for text representation re-construction.
Both settings aim to generate target captions with the same transformer-based
decoder. During the training, we further align representations of theme
concepts learned from images and corresponding captions to enforce the
cross-modality learning. Experimental results on MS COCO show the effectiveness
of our approach compared to some state-of-the-art models.
Related papers
- Non-confusing Generation of Customized Concepts in Diffusion Models [135.4385383284657]
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs)
Existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one.
We propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning.
arXiv Detail & Related papers (2024-05-11T05:01:53Z) - SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for
Multimodal Alignment [11.556516260190737]
Multimodal alignment between language and vision is the fundamental topic in current vision-language model research.
This paper proposes Contrastive Captioners (CoCa) to integrate Contrastive Language-Image Pretraining (CLIP) and Image Caption (IC) into a unified framework.
arXiv Detail & Related papers (2024-01-04T08:42:36Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - RefineCap: Concept-Aware Refinement for Image Captioning [34.35093893441625]
We propose a novel model, termed RefineCap, that refines the output vocabulary of the language decoder using decoder-guided visual semantics.
Our model achieves superior performance on the MS-COCO dataset in comparison with previous visual-concept based models.
arXiv Detail & Related papers (2021-09-08T10:12:14Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.