Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation
Using Scene Object Spectrum Grounding
- URL: http://arxiv.org/abs/2303.04077v1
- Date: Tue, 7 Mar 2023 17:39:53 GMT
- Title: Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation
Using Scene Object Spectrum Grounding
- Authors: Minyoung Hwang, Jaeyeon Jeong, Minsoo Kim, Yoonseon Oh, Songhwai Oh
- Abstract summary: We propose a hierarchical navigation method deploying an exploitation policy to correct misled recent actions.
We show that an exploitation policy, which moves the agent toward a well-chosen local goal, outperforms a method which moves the agent to a previously visited state.
We present a novel visual representation, called scene object spectrum (SOS), which performs category-wise 2D Fourier transform of detected objects.
- Score: 16.784045122994506
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The main challenge in vision-and-language navigation (VLN) is how to
understand natural-language instructions in an unseen environment. The main
limitation of conventional VLN algorithms is that if an action is mistaken, the
agent fails to follow the instructions or explores unnecessary regions, leading
the agent to an irrecoverable path. To tackle this problem, we propose
Meta-Explore, a hierarchical navigation method deploying an exploitation policy
to correct misled recent actions. We show that an exploitation policy, which
moves the agent toward a well-chosen local goal among unvisited but observable
states, outperforms a method which moves the agent to a previously visited
state. We also highlight the demand for imagining regretful explorations with
semantically meaningful clues. The key to our approach is understanding the
object placements around the agent in spectral-domain. Specifically, we present
a novel visual representation, called scene object spectrum (SOS), which
performs category-wise 2D Fourier transform of detected objects. Combining
exploitation policy and SOS features, the agent can correct its path by
choosing a promising local goal. We evaluate our method in three VLN
benchmarks: R2R, SOON, and REVERIE. Meta-Explore outperforms other baselines
and shows significant generalization performance. In addition, local goal
search using the proposed spectral-domain SOS features significantly improves
the success rate by 17.1% and SPL by 20.6% for the SOON benchmark.
Related papers
- Improving Zero-Shot ObjectNav with Generative Communication [60.84730028539513]
We propose a new method for improving zero-shot ObjectNav.
Our approach takes into account that the ground agent may have limited and sometimes obstructed view.
arXiv Detail & Related papers (2024-08-03T22:55:26Z) - GOMAA-Geo: GOal Modality Agnostic Active Geo-localization [49.599465495973654]
We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities.
GOMAA-Geo is a goal modality active geo-localization agent for zero-shot generalization between different goal modalities.
arXiv Detail & Related papers (2024-06-04T02:59:36Z) - Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation [1.2473780585666772]
Most Vision-and-Language Navigation (VLN) algorithms are prone to making inaccurate decisions due to their lack of visual common sense and limited reasoning capabilities.
We propose a Hierarchical Spatial Proximity Reasoning (HSPR) method to help the agent build a knowledge base of hierarchical spatial proximity.
We validate our approach with experiments on publicly available datasets including REVERIE, SOON, R2R, and R4R.
arXiv Detail & Related papers (2024-03-18T07:51:22Z) - Mind the Gap: Improving Success Rate of Vision-and-Language Navigation
by Revisiting Oracle Success Routes [25.944819618283613]
Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction.
We make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR)
arXiv Detail & Related papers (2023-08-07T01:43:25Z) - How To Not Train Your Dragon: Training-free Embodied Object Goal
Navigation with Semantic Frontiers [94.46825166907831]
We present a training-free solution to tackle the object goal navigation problem in Embodied AI.
Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework.
Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers.
arXiv Detail & Related papers (2023-05-26T13:38:33Z) - CLIP the Gap: A Single Domain Generalization Approach for Object
Detection [60.20931827772482]
Single Domain Generalization tackles the problem of training a model on a single source domain so that it generalizes to any unseen target domain.
We propose to leverage a pre-trained vision-language model to introduce semantic domain concepts via textual prompts.
We achieve this via a semantic augmentation strategy acting on the features extracted by the detector backbone, as well as a text-based classification loss.
arXiv Detail & Related papers (2023-01-13T12:01:18Z) - SGoLAM: Simultaneous Goal Localization and Mapping for Multi-Object Goal
Navigation [5.447924312563365]
We present SGoLAM, a simple and efficient algorithm for Multi-Object Goal navigation.
Given an agent equipped with an RGB-D camera and a GPS/ sensor, our objective is to have the agent navigate to a sequence of target objects in realistic 3D environments.
SGoLAM is ranked 2nd in the CVPR 2021 MultiON (Multi-Object Goal Navigation) challenge.
arXiv Detail & Related papers (2021-10-14T06:15:14Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z) - Take the Scenic Route: Improving Generalization in Vision-and-Language
Navigation [44.019674347733506]
We investigate the popular Room-to-Room (R2R) VLN benchmark and discover that what is important is not only the amount of data you synthesize, but also how you do it.
We find that shortest path sampling, which is used by both the R2R benchmark and existing augmentation methods, encode biases in the action space of the agent which we dub as action priors.
We then show that these action priors offer one explanation toward the poor generalization of existing works.
arXiv Detail & Related papers (2020-03-31T14:52:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.