Grounding Linguistic Commands to Navigable Regions
- URL: http://arxiv.org/abs/2112.13031v1
- Date: Fri, 24 Dec 2021 11:11:44 GMT
- Title: Grounding Linguistic Commands to Navigable Regions
- Authors: Nivedita Rufus, Kanishk Jain, Unni Krishnan R Nair, Vineet Gandhi, K
Madhava Krishna
- Abstract summary: We propose the novel task of Referring Navigable Regions (RNR) for autonomous vehicles.
RNR focuses on grounding regions of interest for navigation based on the linguistic command.
We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car dataset.
- Score: 20.368898881882547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans have a natural ability to effortlessly comprehend linguistic commands
such as "park next to the yellow sedan" and instinctively know which region of
the road the vehicle should navigate. Extending this ability to autonomous
vehicles is the next step towards creating fully autonomous agents that respond
and act according to human commands. To this end, we propose the novel task of
Referring Navigable Regions (RNR), i.e., grounding regions of interest for
navigation based on the linguistic command. RNR is different from Referring
Image Segmentation (RIS), which focuses on grounding an object referred to by
the natural language expression instead of grounding a navigable region. For
example, for a command "park next to the yellow sedan," RIS will aim to segment
the referred sedan, and RNR aims to segment the suggested parking region on the
road. We introduce a new dataset, Talk2Car-RegSeg, which extends the existing
Talk2car dataset with segmentation masks for the regions described by the
linguistic commands. A separate test split with concise manoeuvre-oriented
commands is provided to assess the practicality of our dataset. We benchmark
the proposed dataset using a novel transformer-based architecture. We present
extensive ablations and show superior performance over baselines on multiple
evaluation metrics. A downstream path planner generating trajectories based on
RNR outputs confirms the efficacy of the proposed framework.
Related papers
- doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation [0.0]
doScenes is a novel dataset designed to facilitate research on human-vehicle instruction interactions.
DoScenes bridges the gap between instruction and driving response, enabling context-aware and adaptive planning.
arXiv Detail & Related papers (2024-12-08T11:16:47Z) - PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation [30.710806048991923]
Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction.
Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning.
We propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories.
arXiv Detail & Related papers (2024-07-16T08:22:18Z) - Constrained Robotic Navigation on Preferred Terrains Using LLMs and Speech Instruction: Exploiting the Power of Adverbs [29.507826791509384]
This paper explores leveraging large language models for map-free off-road navigation using generative AI.
We propose a method where a robot receives verbal instructions, converted to text through Whisper, and a large language model extracts landmarks, preferred terrains, and crucial adverbs translated into speed settings for constrained navigation.
arXiv Detail & Related papers (2024-04-02T20:46:13Z) - Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching [60.645802236700035]
Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets.
We introduce GeoText-1652, a new natural language-guided geo-localization benchmark.
This dataset is systematically constructed through an interactive human-computer process.
arXiv Detail & Related papers (2023-11-21T17:52:30Z) - Language-Guided 3D Object Detection in Point Cloud for Autonomous
Driving [91.91552963872596]
We propose a new multi-modal visual grounding task, termed LiDAR Grounding.
It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector.
Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
arXiv Detail & Related papers (2023-05-25T06:22:10Z) - SLAN: Self-Locator Aided Network for Cross-Modal Understanding [89.20623874655352]
We propose Self-Locator Aided Network (SLAN) for cross-modal understanding tasks.
SLAN consists of a region filter and a region adaptor to localize regions of interest conditioned on different texts.
It achieves fairly competitive results on five cross-modal understanding tasks.
arXiv Detail & Related papers (2022-11-28T11:42:23Z) - Ground then Navigate: Language-guided Navigation in Dynamic Scenes [13.870303451896248]
We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings.
We solve the problem by explicitly grounding the navigable regions corresponding to the textual command.
We provide extensive qualitative and quantitive empirical results to validate the efficacy of the proposed approach.
arXiv Detail & Related papers (2022-09-24T09:51:09Z) - Grounding Commands for Autonomous Vehicles via Layer Fusion with
Region-specific Dynamic Layer Attention [24.18160842892381]
We study the problem of language grounding for autonomous vehicles, which aims to localize a region in a visual scene according to a natural language command from a passenger.
Our approach helps predict more accurate regions and outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-03-14T02:37:11Z) - Connecting Language and Vision for Natural Language-Based Vehicle
Retrieval [77.88818029640977]
In this paper, we apply one new modality, i.e., the language description, to search the vehicle of interest.
To connect language and vision, we propose to jointly train the state-of-the-art vision models with the transformer-based language model.
Our proposed method has achieved the 1st place on the 5th AI City Challenge, yielding competitive performance 18.69% MRR accuracy.
arXiv Detail & Related papers (2021-05-31T11:42:03Z) - GANav: Group-wise Attention Network for Classifying Navigable Regions in
Unstructured Outdoor Environments [54.21959527308051]
We present a new learning-based method for identifying safe and navigable regions in off-road terrains and unstructured environments from RGB images.
Our approach consists of classifying groups of terrain classes based on their navigability levels using coarse-grained semantic segmentation.
We show through extensive evaluations on the RUGD and RELLIS-3D datasets that our learning algorithm improves the accuracy of visual perception in off-road terrains for navigation.
arXiv Detail & Related papers (2021-03-07T02:16:24Z) - Commands 4 Autonomous Vehicles (C4AV) Workshop Summary [91.92872482200018]
This paper presents the results of the emphCommands for Autonomous Vehicles (C4AV) challenge based on the recent emphTalk2Car dataset.
We identify the aspects that render top-performing models successful, and relate them to existing state-of-the-art models for visual grounding.
arXiv Detail & Related papers (2020-09-18T12:33:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.