Related papers: Grounding Commands for Autonomous Vehicles via Layer Fusion with Region-specific Dynamic Layer Attention

Grounding Commands for Autonomous Vehicles via Layer Fusion with Region-specific Dynamic Layer Attention

URL: http://arxiv.org/abs/2203.06822v1
Date: Mon, 14 Mar 2022 02:37:11 GMT
Title: Grounding Commands for Autonomous Vehicles via Layer Fusion with Region-specific Dynamic Layer Attention
Authors: Hou Pong Chan, Mingxi Guo, Cheng-Zhong Xu
Abstract summary: We study the problem of language grounding for autonomous vehicles, which aims to localize a region in a visual scene according to a natural language command from a passenger. Our approach helps predict more accurate regions and outperforms state-of-the-art methods.
Score: 24.18160842892381
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Grounding a command to the visual environment is an essential ingredient for interactions between autonomous vehicles and humans. In this work, we study the problem of language grounding for autonomous vehicles, which aims to localize a region in a visual scene according to a natural language command from a passenger. Prior work only employs the top layer representations of a vision-and-language pre-trained model to predict the region referred to by the command. However, such a method omits the useful features encoded in other layers, and thus results in inadequate understanding of the input scene and command. To tackle this limitation, we present the first layer fusion approach for this task. Since different visual regions may require distinct types of features to disambiguate them from each other, we further propose the region-specific dynamic (RSD) layer attention to adaptively fuse the multimodal information across layers for each region. Extensive experiments on the Talk2Car benchmark demonstrate that our approach helps predict more accurate regions and outperforms state-of-the-art methods.

Related papers

Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting [4.568580817155409]
Amodal completion is crucial for understanding human-object interactions in computer vision and robotics.<n>We develop a new approach that uses physical prior knowledge along with a specialized multi-regional inpainting technique designed for HOI.<n>Our experimental results show that our approach significantly outperforms existing methods in HOI scenarios.
arXiv Detail & Related papers (2025-08-01T08:33:45Z)
Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction [4.692621855184482]
Single-Domain Generalized Object Detection(S-DGOD) aims to train an object detector on a single source domain. Recent S-DGOD approaches exploit pre-trained vision-language knowledge to guide invariant feature learning across visual domains. We propose a new cross-modal feature learning method, which can capture generalized and discriminative regional features for S-DGOD tasks.
arXiv Detail & Related papers (2025-04-27T02:55:54Z)
GOMAA-Geo: GOal Modality Agnostic Active Geo-localization [49.599465495973654]
We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities. GOMAA-Geo is a goal modality active geo-localization agent for zero-shot generalization between different goal modalities.
arXiv Detail & Related papers (2024-06-04T02:59:36Z)
DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution [54.05367433562495]
Region-level multi-modality methods can translate referred image regions to human preferred language descriptions. Unfortunately, most of existing methods using fixed visual inputs remain lacking the resolution adaptability to find out precise language descriptions. We propose a dynamic resolution approach, referred to as DynRefer, to pursue high-accuracy region-level referring.
arXiv Detail & Related papers (2024-05-25T05:44:55Z)
Mapping High-level Semantic Regions in Indoor Environments without Object Recognition [50.624970503498226]
The present work proposes a method for semantic region mapping via embodied navigation in indoor environments. To enable region identification, the method uses a vision-to-language model to provide scene information for mapping. By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location.
arXiv Detail & Related papers (2024-03-11T18:09:50Z)
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving [91.91552963872596]
We propose a new multi-modal visual grounding task, termed LiDAR Grounding. It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector. Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
arXiv Detail & Related papers (2023-05-25T06:22:10Z)
SLAN: Self-Locator Aided Network for Cross-Modal Understanding [89.20623874655352]
We propose Self-Locator Aided Network (SLAN) for cross-modal understanding tasks. SLAN consists of a region filter and a region adaptor to localize regions of interest conditioned on different texts. It achieves fairly competitive results on five cross-modal understanding tasks.
arXiv Detail & Related papers (2022-11-28T11:42:23Z)
Point-Level Region Contrast for Object Detection Pre-Training [147.47349344401806]
We present point-level region contrast, a self-supervised pre-training approach for the task of object detection. Our approach performs contrastive learning by directly sampling individual point pairs from different regions. Compared to an aggregated representation per region, our approach is more robust to the change in input region quality.
arXiv Detail & Related papers (2022-02-09T18:56:41Z)
Grounding Linguistic Commands to Navigable Regions [20.368898881882547]
We propose the novel task of Referring Navigable Regions (RNR) for autonomous vehicles. RNR focuses on grounding regions of interest for navigation based on the linguistic command. We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car dataset.
arXiv Detail & Related papers (2021-12-24T11:11:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.