Vision Transformer for Learning Driving Policies in Complex Multi-Agent
Environments
- URL: http://arxiv.org/abs/2109.06514v1
- Date: Tue, 14 Sep 2021 08:18:47 GMT
- Title: Vision Transformer for Learning Driving Policies in Complex Multi-Agent
Environments
- Authors: Eshagh Kargar, Ville Kyrki
- Abstract summary: We propose to use Vision Transformer (ViT) to learn a driving policy in urban settings with birds-eye-view (BEV) input images.
The ViT network learns the global context of the scene more effectively than with earlier proposed Convolutional Neural Networks (ConvNets)
- Score: 17.825845543579195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Driving in a complex urban environment is a difficult task that requires a
complex decision policy. In order to make informed decisions, one needs to gain
an understanding of the long-range context and the importance of other
vehicles. In this work, we propose to use Vision Transformer (ViT) to learn a
driving policy in urban settings with birds-eye-view (BEV) input images. The
ViT network learns the global context of the scene more effectively than with
earlier proposed Convolutional Neural Networks (ConvNets). Furthermore, ViT's
attention mechanism helps to learn an attention map for the scene which allows
the ego car to determine which surrounding cars are important to its next
decision. We demonstrate that a DQN agent with a ViT backbone outperforms
baseline algorithms with ConvNet backbones pre-trained in various ways. In
particular, the proposed method helps reinforcement learning algorithms to
learn faster, with increased performance and less data than baselines.
Related papers
- Hierarchical and Decoupled BEV Perception Learning Framework for Autonomous Driving [52.808273563372126]
This paper proposes a novel hierarchical Bird's-eye-view (BEV) perception paradigm.
It aims to provide a library of fundamental perception modules and user-friendly graphical interface.
We conduct the Pretrain-Finetune strategy to effectively utilize large scale public datasets and streamline development processes.
arXiv Detail & Related papers (2024-07-17T11:17:20Z) - Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Vehicle Decision-Making in Dynamic Environment [2.3575550107698016]
We introduce an AV centrictemporal attention encoding (STAE) mechanism for learning dynamic interactions with different surrounding vehicles.
To understand map and route context, we employ a context encoder to extract context maps.
The resulting model is trained using the Soft Actor Critic (SAC) algorithm.
arXiv Detail & Related papers (2024-07-12T02:34:44Z) - DriveLM: Driving with Graph Visual Question Answering [57.51930417790141]
We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems.
We propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving.
arXiv Detail & Related papers (2023-12-21T18:59:12Z) - Deep Perspective Transformation Based Vehicle Localization on Bird's Eye
View [0.49747156441456597]
Traditional approaches rely on installing multiple sensors to simulate the environment.
We propose an alternative solution by generating a top-down representation of the scene.
We present an architecture that transforms perspective view RGB images into bird's-eye-view maps with segmented surrounding vehicles.
arXiv Detail & Related papers (2023-11-12T10:16:42Z) - Federated Deep Learning Meets Autonomous Vehicle Perception: Design and
Verification [168.67190934250868]
Federated learning empowered connected autonomous vehicle (FLCAV) has been proposed.
FLCAV preserves privacy while reducing communication and annotation costs.
It is challenging to determine the network resources and road sensor poses for multi-stage training.
arXiv Detail & Related papers (2022-06-03T23:55:45Z) - Structured Bird's-Eye-View Traffic Scene Understanding from Onboard
Images [128.881857704338]
We study the problem of extracting a directed graph representing the local road network in BEV coordinates, from a single onboard camera image.
We show that the method can be extended to detect dynamic objects on the BEV plane.
We validate our approach against powerful baselines and show that our network achieves superior performance.
arXiv Detail & Related papers (2021-10-05T12:40:33Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z) - Graph Neural Network Reinforcement Learning for Autonomous
Mobility-on-Demand Systems [42.08603087208381]
We argue that the AMoD control problem is naturally cast as a node-wise decision-making problem.
We propose a deep reinforcement learning framework to control the rebalancing of AMoD systems through graph neural networks.
We show how the learned policies exhibit promising zero-shot transfer capabilities when faced with critical portability tasks.
arXiv Detail & Related papers (2021-04-23T06:42:38Z) - Increasing the Efficiency of Policy Learning for Autonomous Vehicles by
Multi-Task Representation Learning [17.825845543579195]
We propose to learn a low-dimensional and rich latent representation of the environment by leveraging the knowledge of relevant semantic factors.
We also propose a hazard signal in addition to the learned latent representation as input to a down-stream policy.
In particular, the proposed representation learning and the hazard signal help reinforcement learning to learn faster, with increased performance and less data than baseline methods.
arXiv Detail & Related papers (2021-03-26T20:16:59Z) - Autonomous Navigation through intersections with Graph
ConvolutionalNetworks and Conditional Imitation Learning for Self-driving
Cars [10.080958939027363]
In autonomous driving, navigation through unsignaled intersections is a challenging task.
We propose a novel branched network G-CIL for the navigation policy learning.
Our end-to-end trainable neural network outperforms the baselines with higher success rate and shorter navigation time.
arXiv Detail & Related papers (2021-02-01T07:33:12Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.