Language-Based Augmentation to Address Shortcut Learning in Object Goal
Navigation
- URL: http://arxiv.org/abs/2402.05090v1
- Date: Wed, 7 Feb 2024 18:44:27 GMT
- Title: Language-Based Augmentation to Address Shortcut Learning in Object Goal
Navigation
- Authors: Dennis Hoftijzer and Gertjan Burghouts and Luuk Spreeuwers
- Abstract summary: We aim to deepen our understanding of shortcut learning in ObjectNav.
We observe poor generalization of a state-of-the-art (SOTA) ObjectNav method to environments where this is not the case.
We find that shortcut learning is the root cause: the agent learns to navigate to target objects, by simply searching for the associated wall color of the target object's room.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Reinforcement Learning (DRL) has shown great potential in enabling
robots to find certain objects (e.g., `find a fridge') in environments like
homes or schools. This task is known as Object-Goal Navigation (ObjectNav). DRL
methods are predominantly trained and evaluated using environment simulators.
Although DRL has shown impressive results, the simulators may be biased or
limited. This creates a risk of shortcut learning, i.e., learning a policy
tailored to specific visual details of training environments. We aim to deepen
our understanding of shortcut learning in ObjectNav, its implications and
propose a solution. We design an experiment for inserting a shortcut bias in
the appearance of training environments. As a proof-of-concept, we associate
room types to specific wall colors (e.g., bedrooms with green walls), and
observe poor generalization of a state-of-the-art (SOTA) ObjectNav method to
environments where this is not the case (e.g., bedrooms with blue walls). We
find that shortcut learning is the root cause: the agent learns to navigate to
target objects, by simply searching for the associated wall color of the target
object's room. To solve this, we propose Language-Based (L-B) augmentation. Our
key insight is that we can leverage the multimodal feature space of a
Vision-Language Model (VLM) to augment visual representations directly at the
feature-level, requiring no changes to the simulator, and only an addition of
one layer to the model. Where the SOTA ObjectNav method's success rate drops
69%, our proposal has only a drop of 23%.
Related papers
- DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects [84.73092715537364]
In this paper, we study a new task of navigating to diverse target objects in a large number of scene types.
We build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning.
Our agent achieves a success rate that surpasses GPT-4o by over 20%.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models [16.50443396055173]
We propose OpenFMNav, an Open-set Foundation Model based framework for zero-shot object navigation.
We first unleash the reasoning abilities of large language models to extract proposed objects from natural language instructions.
We then leverage the generalizability of large vision language models to actively discover and detect candidate objects from the scene.
arXiv Detail & Related papers (2024-02-16T13:21:33Z) - Object Goal Navigation with Recursive Implicit Maps [92.6347010295396]
We propose an implicit spatial map for object goal navigation.
Our method significantly outperforms the state of the art on the challenging MP3D dataset.
We deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes.
arXiv Detail & Related papers (2023-08-10T14:21:33Z) - OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes.
It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z) - Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration
for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON)
Our approach makes use of Large Language Models (LLMs) for this task.
We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z) - ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings [43.65945397307492]
We present a scalable approach for learning open-world object-goal navigation (ObjectNav)
Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind.
arXiv Detail & Related papers (2022-06-24T17:59:02Z) - Zero-shot object goal visual navigation [15.149900666249096]
In real households, there may exist numerous object classes that the robot needs to deal with.
We propose a zero-shot object navigation task by combining zero-shot learning with object goal visual navigation.
Our model outperforms the baseline models in both seen and unseen classes.
arXiv Detail & Related papers (2022-06-15T09:53:43Z) - Auxiliary Tasks and Exploration Enable ObjectNav [48.314102158070874]
We re-enable a generic learned agent by adding auxiliary learning tasks and an exploration reward.
Our agents achieve 24.5% success and 8.1% SPL, a 37% and 8% relative improvement over prior state-of-the-art, respectively.
arXiv Detail & Related papers (2021-04-08T23:03:21Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.