The Role of Syntactic Planning in Compositional Image Captioning
- URL: http://arxiv.org/abs/2101.11911v1
- Date: Thu, 28 Jan 2021 10:26:08 GMT
- Title: The Role of Syntactic Planning in Compositional Image Captioning
- Authors: Emanuele Bugliarello, Desmond Elliott
- Abstract summary: In this work, we investigate methods to improve compositional generalization by planning the syntactic structure of a caption.
Our experiments show that jointly modeling tokens and tags generalization in both RNN- and Transformer-based models, while also improving performance on standard metrics.
- Score: 17.363891408746298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning has focused on generalizing to images drawn from the same
distribution as the training set, and not to the more challenging problem of
generalizing to different distributions of images. Recently, Nikolaus et al.
(2019) introduced a dataset to assess compositional generalization in image
captioning, where models are evaluated on their ability to describe images with
unseen adjective-noun and noun-verb compositions. In this work, we investigate
different methods to improve compositional generalization by planning the
syntactic structure of a caption. Our experiments show that jointly modeling
tokens and syntactic tags enhances generalization in both RNN- and
Transformer-based models, while also improving performance on standard metrics.
Related papers
- What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - The Role of Linguistic Priors in Measuring Compositional Generalization
of Vision-Language Models [64.43764443000003]
We identify two sources of visual-linguistic compositionality: linguistic priors and the interplay between images and texts.
We propose a new metric for compositionality without such linguistic priors.
arXiv Detail & Related papers (2023-10-04T12:48:33Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Improving Compositional Generalization in Classification Tasks via
Structure Annotations [33.90268697120572]
Humans have a great ability to generalize compositionally, but state-of-the-art neural models struggle to do so.
First, we study ways to convert a natural language sequence-to-sequence dataset to a classification dataset that also requires compositional generalization.
Second, we show that providing structural hints (specifically, providing parse trees and entity links as attention masks for a Transformer model) helps compositional generalization.
arXiv Detail & Related papers (2021-06-19T06:07:27Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Compositional Generalization via Semantic Tagging [81.24269148865555]
We propose a new decoding framework that preserves the expressivity and generality of sequence-to-sequence models.
We show that the proposed approach consistently improves compositional generalization across model architectures, domains, and semantic formalisms.
arXiv Detail & Related papers (2020-10-22T15:55:15Z) - Image Captioning with Compositional Neural Module Networks [18.27510863075184]
We introduce a hierarchical framework for image captioning that explores both compositionality and sequentiality of natural language.
Our algorithm learns to compose a detail-rich sentence by selectively attending to different modules corresponding to unique aspects of each object detected in an input image.
arXiv Detail & Related papers (2020-07-10T20:58:04Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.