Linguistic Structures as Weak Supervision for Visual Scene Graph
Generation
- URL: http://arxiv.org/abs/2105.13994v1
- Date: Fri, 28 May 2021 17:20:27 GMT
- Title: Linguistic Structures as Weak Supervision for Visual Scene Graph
Generation
- Authors: Keren Ye and Adriana Kovashka
- Abstract summary: We show how linguistic structures in captions can benefit scene graph generation.
Our method captures the information provided in captions about relations between individual triplets, and context for subjects and objects.
Given the large and diverse sources of multimodal data on the web, linguistic supervision is more scalable than crowdsourced triplets.
- Score: 39.918783911894245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior work in scene graph generation requires categorical supervision at the
level of triplets - subjects and objects, and predicates that relate them,
either with or without bounding box information. However, scene graph
generation is a holistic task: thus holistic, contextual supervision should
intuitively improve performance. In this work, we explore how linguistic
structures in captions can benefit scene graph generation. Our method captures
the information provided in captions about relations between individual
triplets, and context for subjects and objects (e.g. visual properties are
mentioned). Captions are a weaker type of supervision than triplets since the
alignment between the exhaustive list of human-annotated subjects and objects
in triplets, and the nouns in captions, is weak. However, given the large and
diverse sources of multimodal data on the web (e.g. blog posts with images and
captions), linguistic supervision is more scalable than crowdsourced triplets.
We show extensive experimental comparisons against prior methods which leverage
instance- and image-level supervision, and ablate our method to show the impact
of leveraging phrasal and sequential context, and techniques to improve
localization of subjects and objects.
Related papers
- GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives [69.36723767339001]
We propose a novel framework named textitGPT4SGG to obtain more accurate and comprehensive scene graph signals.
We show textitGPT4SGG significantly improves the performance of SGG models trained on image-caption data.
arXiv Detail & Related papers (2023-12-07T14:11:00Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.