Improving Image Captioning via Predicting Structured Concepts
- URL: http://arxiv.org/abs/2311.08223v2
- Date: Tue, 28 Nov 2023 04:05:03 GMT
- Title: Improving Image Captioning via Predicting Structured Concepts
- Authors: Ting Wang, Weidong Chen, Yuanhe Tian, Yan Song, Zhendong Mao
- Abstract summary: We propose a structured concept predictor to predict concepts and their structures, then we integrate them into captioning.
We design weighted graph convolutional networks (W-GCN) to depict concept relations driven by word dependencies.
Our approach captures potential relations among concepts and discriminatively learns different concepts, so that effectively facilitates image captioning with inherited information.
- Score: 46.88858655641866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Having the difficulty of solving the semantic gap between images and texts
for the image captioning task, conventional studies in this area paid some
attention to treating semantic concepts as a bridge between the two modalities
and improved captioning performance accordingly. Although promising results on
concept prediction were obtained, the aforementioned studies normally ignore
the relationship among concepts, which relies on not only objects in the image,
but also word dependencies in the text, so that offers a considerable potential
for improving the process of generating good descriptions. In this paper, we
propose a structured concept predictor (SCP) to predict concepts and their
structures, then we integrate them into captioning, so as to enhance the
contribution of visual signals in this task via concepts and further use their
relations to distinguish cross-modal semantics for better description
generation. Particularly, we design weighted graph convolutional networks
(W-GCN) to depict concept relations driven by word dependencies, and then
learns differentiated contributions from these concepts for following decoding
process. Therefore, our approach captures potential relations among concepts
and discriminatively learns different concepts, so that effectively facilitates
image captioning with inherited information across modalities. Extensive
experiments and their results demonstrate the effectiveness of our approach as
well as each proposed module in this work.
Related papers
- CusConcept: Customized Visual Concept Decomposition with Diffusion Models [13.95568624067449]
We propose a two-stage framework, CusConcept, to extract customized visual concept embedding vectors.
In the first stage, CusConcept employs a vocabularies-guided concept decomposition mechanism.
In the second stage, joint concept refinement is performed to enhance the fidelity and quality of generated images.
arXiv Detail & Related papers (2024-10-01T04:41:44Z) - Non-confusing Generation of Customized Concepts in Diffusion Models [135.4385383284657]
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs)
Existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one.
We propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning.
arXiv Detail & Related papers (2024-05-11T05:01:53Z) - Visual Concept-driven Image Generation with Text-to-Image Diffusion Model [65.96212844602866]
Text-to-image (TTI) models have demonstrated impressive results in generating high-resolution images of complex scenes.
Recent approaches have extended these methods with personalization techniques that allow them to integrate user-illustrated concepts.
However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive.
We propose a concept-driven TTI personalization framework that addresses these core challenges.
arXiv Detail & Related papers (2024-02-18T07:28:37Z) - Advancing Ante-Hoc Explainable Models through Generative Adversarial Networks [24.45212348373868]
This paper presents a novel concept learning framework for enhancing model interpretability and performance in visual classification tasks.
Our approach appends an unsupervised explanation generator to the primary classifier network and makes use of adversarial training.
This work presents a significant step towards building inherently interpretable deep vision models with task-aligned concept representations.
arXiv Detail & Related papers (2024-01-09T16:16:16Z) - Cross-Modal Conceptualization in Bottleneck Models [21.2577097041883]
Concept Bottleneck Models (CBMs) assume that training examples (e.g., x-ray images) are annotated with high-level concepts.
In our approach, we adopt a more moderate assumption and instead use text descriptions, accompanying the images in training, to guide the induction of concepts.
Our cross-modal approach treats concepts as discrete latent variables and promotes concepts that (1) are predictive of the label, and (2) can be predicted reliably from both the image and text.
arXiv Detail & Related papers (2023-10-23T11:00:19Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - Cross-Modal Concept Learning and Inference for Vision-Language Models [31.463771883036607]
In existing fine-tuning methods, the class-specific text description is matched against the whole image.
We develop a new method called cross-model concept learning and inference (CCLI)
Our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts.
arXiv Detail & Related papers (2023-07-28T10:26:28Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.