Graph-Structured Referring Expression Reasoning in The Wild
- URL: http://arxiv.org/abs/2004.08814v1
- Date: Sun, 19 Apr 2020 11:00:30 GMT
- Title: Graph-Structured Referring Expression Reasoning in The Wild
- Authors: Sibei Yang, Guanbin Li, Yizhou Yu
- Abstract summary: Grounding referring expressions aims to locate in an image an object referred to by a natural language expression.
We propose a scene graph guided modular network (SGMN) to perform reasoning over a semantic graph and a scene graph.
We also propose Ref-Reasoning, a large-scale real-world dataset for structured referring expression reasoning.
- Score: 105.95488002374158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounding referring expressions aims to locate in an image an object referred
to by a natural language expression. The linguistic structure of a referring
expression provides a layout of reasoning over the visual contents, and it is
often crucial to align and jointly understand the image and the referring
expression. In this paper, we propose a scene graph guided modular network
(SGMN), which performs reasoning over a semantic graph and a scene graph with
neural modules under the guidance of the linguistic structure of the
expression. In particular, we model the image as a structured semantic graph,
and parse the expression into a language scene graph. The language scene graph
not only decodes the linguistic structure of the expression, but also has a
consistent representation with the image semantic graph. In addition to
exploring structured solutions to grounding referring expressions, we also
propose Ref-Reasoning, a large-scale real-world dataset for structured
referring expression reasoning. We automatically generate referring expressions
over the scene graphs of images using diverse expression templates and
functional programs. This dataset is equipped with real-world visual contents
as well as semantically rich expressions with different reasoning layouts.
Experimental results show that our SGMN not only significantly outperforms
existing state-of-the-art algorithms on the new Ref-Reasoning dataset, but also
surpasses state-of-the-art structured methods on commonly used benchmark
datasets. It can also provide interpretable visual evidences of reasoning. Data
and code are available at https://github.com/sibeiyang/sgmn
Related papers
- FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types.
We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z) - Visual Semantic Parsing: From Images to Abstract Meaning Representation [20.60579156219413]
We propose to leverage a widely-used meaning representation in the field of natural language processing, the Abstract Meaning Representation (AMR)
Our visual AMR graphs are more linguistically informed, with a focus on higher-level semantic concepts extrapolated from visual input.
Our findings point to important future research directions for improved scene understanding.
arXiv Detail & Related papers (2022-10-26T17:06:42Z) - Scene Graph Modification as Incremental Structure Expanding [61.84291817776118]
We focus on scene graph modification (SGM), where the system is required to learn how to update an existing scene graph based on a natural language query.
We frame SGM as a graph expansion task by introducing the incremental structure expanding (ISE)
We construct a challenging dataset that contains more complicated queries and larger scene graphs than existing datasets.
arXiv Detail & Related papers (2022-09-15T16:26:14Z) - Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - Learning to Represent Image and Text with Denotation Graph [32.417311523031195]
We propose learning representations from a set of implied, visually grounded expressions between image and text.
We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations.
arXiv Detail & Related papers (2020-10-06T18:00:58Z) - GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions.
The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z) - Referring Expression Comprehension: A Survey of Methods and Datasets [20.42495629501261]
Referring expression comprehension (REC) aims to localize a target object in an image described by a referring expression phrased in natural language.
We first examine the state of the art by comparing modern approaches to the problem.
We discuss modular architectures and graph-based models that interface with structured graph representation.
arXiv Detail & Related papers (2020-07-19T01:45:02Z) - Cops-Ref: A new Dataset and Task on Compositional Referring Expression
Comprehension [39.40351938417889]
Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression.
Some popular referring expression datasets fail to provide an ideal test bed for evaluating the reasoning ability of the models.
We propose a new dataset for visual reasoning in context of referring expression comprehension with two main features.
arXiv Detail & Related papers (2020-03-01T04:59:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.