Grounding Commands for Autonomous Vehicles via Layer Fusion with
Region-specific Dynamic Layer Attention
- URL: http://arxiv.org/abs/2203.06822v1
- Date: Mon, 14 Mar 2022 02:37:11 GMT
- Title: Grounding Commands for Autonomous Vehicles via Layer Fusion with
Region-specific Dynamic Layer Attention
- Authors: Hou Pong Chan, Mingxi Guo, Cheng-Zhong Xu
- Abstract summary: We study the problem of language grounding for autonomous vehicles, which aims to localize a region in a visual scene according to a natural language command from a passenger.
Our approach helps predict more accurate regions and outperforms state-of-the-art methods.
- Score: 24.18160842892381
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Grounding a command to the visual environment is an essential ingredient for
interactions between autonomous vehicles and humans. In this work, we study the
problem of language grounding for autonomous vehicles, which aims to localize a
region in a visual scene according to a natural language command from a
passenger. Prior work only employs the top layer representations of a
vision-and-language pre-trained model to predict the region referred to by the
command. However, such a method omits the useful features encoded in other
layers, and thus results in inadequate understanding of the input scene and
command. To tackle this limitation, we present the first layer fusion approach
for this task. Since different visual regions may require distinct types of
features to disambiguate them from each other, we further propose the
region-specific dynamic (RSD) layer attention to adaptively fuse the multimodal
information across layers for each region. Extensive experiments on the
Talk2Car benchmark demonstrate that our approach helps predict more accurate
regions and outperforms state-of-the-art methods.
Related papers
- Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction [4.692621855184482]
Single-Domain Generalized Object Detection(S-DGOD) aims to train an object detector on a single source domain.
Recent S-DGOD approaches exploit pre-trained vision-language knowledge to guide invariant feature learning across visual domains.
We propose a new cross-modal feature learning method, which can capture generalized and discriminative regional features for S-DGOD tasks.
arXiv Detail & Related papers (2025-04-27T02:55:54Z) - GOMAA-Geo: GOal Modality Agnostic Active Geo-localization [49.599465495973654]
We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities.
GOMAA-Geo is a goal modality active geo-localization agent for zero-shot generalization between different goal modalities.
arXiv Detail & Related papers (2024-06-04T02:59:36Z) - DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution [54.05367433562495]
Region-level multi-modality methods can translate referred image regions to human preferred language descriptions.
Unfortunately, most of existing methods using fixed visual inputs remain lacking the resolution adaptability to find out precise language descriptions.
We propose a dynamic resolution approach, referred to as DynRefer, to pursue high-accuracy region-level referring.
arXiv Detail & Related papers (2024-05-25T05:44:55Z) - Mapping High-level Semantic Regions in Indoor Environments without
Object Recognition [50.624970503498226]
The present work proposes a method for semantic region mapping via embodied navigation in indoor environments.
To enable region identification, the method uses a vision-to-language model to provide scene information for mapping.
By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location.
arXiv Detail & Related papers (2024-03-11T18:09:50Z) - Language-Guided 3D Object Detection in Point Cloud for Autonomous
Driving [91.91552963872596]
We propose a new multi-modal visual grounding task, termed LiDAR Grounding.
It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector.
Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
arXiv Detail & Related papers (2023-05-25T06:22:10Z) - SLAN: Self-Locator Aided Network for Cross-Modal Understanding [89.20623874655352]
We propose Self-Locator Aided Network (SLAN) for cross-modal understanding tasks.
SLAN consists of a region filter and a region adaptor to localize regions of interest conditioned on different texts.
It achieves fairly competitive results on five cross-modal understanding tasks.
arXiv Detail & Related papers (2022-11-28T11:42:23Z) - Point-Level Region Contrast for Object Detection Pre-Training [147.47349344401806]
We present point-level region contrast, a self-supervised pre-training approach for the task of object detection.
Our approach performs contrastive learning by directly sampling individual point pairs from different regions.
Compared to an aggregated representation per region, our approach is more robust to the change in input region quality.
arXiv Detail & Related papers (2022-02-09T18:56:41Z) - Grounding Linguistic Commands to Navigable Regions [20.368898881882547]
We propose the novel task of Referring Navigable Regions (RNR) for autonomous vehicles.
RNR focuses on grounding regions of interest for navigation based on the linguistic command.
We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car dataset.
arXiv Detail & Related papers (2021-12-24T11:11:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.