Language and Visual Entity Relationship Graph for Agent Navigation
- URL: http://arxiv.org/abs/2010.09304v2
- Date: Fri, 25 Dec 2020 02:43:43 GMT
- Title: Language and Visual Entity Relationship Graph for Agent Navigation
- Authors: Yicong Hong, Cristian Rodriguez-Opazo, Yuankai Qi, Qi Wu, Stephen
Gould
- Abstract summary: Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions.
We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision.
Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
- Score: 54.059606864535304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Navigation (VLN) requires an agent to navigate in a
real-world environment following natural language instructions. From both the
textual and visual perspectives, we find that the relationships among the
scene, its objects,and directional clues are essential for the agent to
interpret complex instructions and correctly perceive the environment. To
capture and utilize the relationships, we propose a novel Language and Visual
Entity Relationship Graph for modelling the inter-modal relationships between
text and vision, and the intra-modal relationships among visual entities. We
propose a message passing algorithm for propagating information between
language elements and visual entities in the graph, which we then combine to
determine the next action to take. Experiments show that by taking advantage of
the relationships we are able to improve over state-of-the-art. On the
Room-to-Room (R2R) benchmark, our method achieves the new best performance on
the test unseen split with success rate weighted by path length (SPL) of 52%.
On the Room-for-Room (R4R) dataset, our method significantly improves the
previous best from 13% to 34% on the success weighted by normalized dynamic
time warping (SDTW). Code is available at:
https://github.com/YicongHong/Entity-Graph-VLN.
Related papers
- SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z) - VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation [11.76365012394685]
Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image understanding by segmenting objects and predicting relations among objects.
Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names.
We propose to use language information to assist relation prediction, particularly for rare relations.
arXiv Detail & Related papers (2023-11-27T17:05:25Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - Improving Cross-Modal Alignment in Vision Language Navigation via
Syntactic Information [83.62098382773266]
Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions.
We propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes.
Our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages.
arXiv Detail & Related papers (2021-04-19T19:18:41Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.