Visual Semantic Parsing: From Images to Abstract Meaning Representation
- URL: http://arxiv.org/abs/2210.14862v2
- Date: Thu, 27 Oct 2022 15:54:50 GMT
- Title: Visual Semantic Parsing: From Images to Abstract Meaning Representation
- Authors: Mohamed Ashraf Abdelsalam, Zhan Shi, Federico Fancellu, Kalliopi
Basioti, Dhaivat J. Bhatt, Vladimir Pavlovic and Afsaneh Fazly
- Abstract summary: We propose to leverage a widely-used meaning representation in the field of natural language processing, the Abstract Meaning Representation (AMR)
Our visual AMR graphs are more linguistically informed, with a focus on higher-level semantic concepts extrapolated from visual input.
Our findings point to important future research directions for improved scene understanding.
- Score: 20.60579156219413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of scene graphs for visual scene understanding has brought
attention to the benefits of abstracting a visual input (e.g., image) into a
structured representation, where entities (people and objects) are nodes
connected by edges specifying their relations. Building these representations,
however, requires expensive manual annotation in the form of images paired with
their scene graphs or frames. These formalisms remain limited in the nature of
entities and relations they can capture. In this paper, we propose to leverage
a widely-used meaning representation in the field of natural language
processing, the Abstract Meaning Representation (AMR), to address these
shortcomings. Compared to scene graphs, which largely emphasize spatial
relationships, our visual AMR graphs are more linguistically informed, with a
focus on higher-level semantic concepts extrapolated from visual input.
Moreover, they allow us to generate meta-AMR graphs to unify information
contained in multiple image descriptions under one representation. Through
extensive experimentation and analysis, we demonstrate that we can re-purpose
an existing text-to-AMR parser to parse images into AMRs. Our findings point to
important future research directions for improved scene understanding.
Related papers
- Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - SGRAM: Improving Scene Graph Parsing via Abstract Meaning Representation [24.93559076181481]
Scene graph is structured semantic representation that can be modeled as a form of graph from images and texts.
In this paper, we focus on the problem of scene graph parsing from textual description of a visual scene.
We design a simple yet effective two-stage scene graph parsing framework utilizing abstract meaning representation.
arXiv Detail & Related papers (2022-10-17T00:37:00Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - Learning to Represent Image and Text with Denotation Graph [32.417311523031195]
We propose learning representations from a set of implied, visually grounded expressions between image and text.
We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations.
arXiv Detail & Related papers (2020-10-06T18:00:58Z) - GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions.
The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Graph-Structured Referring Expression Reasoning in The Wild [105.95488002374158]
Grounding referring expressions aims to locate in an image an object referred to by a natural language expression.
We propose a scene graph guided modular network (SGMN) to perform reasoning over a semantic graph and a scene graph.
We also propose Ref-Reasoning, a large-scale real-world dataset for structured referring expression reasoning.
arXiv Detail & Related papers (2020-04-19T11:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.