MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and
Unpaired Text-based Image Captioning
- URL: http://arxiv.org/abs/2112.06558v1
- Date: Mon, 13 Dec 2021 11:00:49 GMT
- Title: MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and
Unpaired Text-based Image Captioning
- Authors: Wenqiao Zhang, Haochen Shi, Jiannan Guo, Shengyu Zhang, Qingpeng Cai,
Juncheng Li, Sihui Luo, Yueting Zhuang
- Abstract summary: A text-based image intuitively contains abundant and complex multimodal relational content.
We propose the Multimodal relAtional Graph adversarIal inferenCe framework for diverse and unpaired TextCap.
We validate the effectiveness of MAGIC in generating diverse captions from different relational information items of an image.
- Score: 46.4308182215488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based image captioning (TextCap) requires simultaneous comprehension of
visual content and reading the text of images to generate a natural language
description. Although a task can teach machines to understand the complex human
environment further given that text is omnipresent in our daily surroundings,
it poses additional challenges in normal captioning. A text-based image
intuitively contains abundant and complex multimodal relational content, that
is, image details can be described diversely from multiview rather than a
single caption. Certainly, we can introduce additional paired training data to
show the diversity of images' descriptions, this process is labor-intensive and
time-consuming for TextCap pair annotations with extra texts. Based on the
insight mentioned above, we investigate how to generate diverse captions that
focus on different image parts using an unpaired training paradigm. We propose
the Multimodal relAtional Graph adversarIal inferenCe (MAGIC) framework for
diverse and unpaired TextCap. This framework can adaptively construct multiple
multimodal relational graphs of images and model complex relationships among
graphs to represent descriptive diversity. Moreover, a cascaded generative
adversarial network is developed from modeled graphs to infer the unpaired
caption generation in image-sentence feature alignment and linguistic coherence
levels. We validate the effectiveness of MAGIC in generating diverse captions
from different relational information items of an image. Experimental results
show that MAGIC can generate very promising outcomes without using any
image-caption training pairs.
Related papers
- Text Data-Centric Image Captioning with Interactive Prompts [20.48013600818985]
Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data.
This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap.
arXiv Detail & Related papers (2024-03-28T07:43:49Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.