Vision Transformer for Learning Driving Policies in Complex Multi-Agent
Environments
- URL: http://arxiv.org/abs/2109.06514v1
- Date: Tue, 14 Sep 2021 08:18:47 GMT
- Title: Vision Transformer for Learning Driving Policies in Complex Multi-Agent
Environments
- Authors: Eshagh Kargar, Ville Kyrki
- Abstract summary: We propose to use Vision Transformer (ViT) to learn a driving policy in urban settings with birds-eye-view (BEV) input images.
The ViT network learns the global context of the scene more effectively than with earlier proposed Convolutional Neural Networks (ConvNets)
- Score: 17.825845543579195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Driving in a complex urban environment is a difficult task that requires a
complex decision policy. In order to make informed decisions, one needs to gain
an understanding of the long-range context and the importance of other
vehicles. In this work, we propose to use Vision Transformer (ViT) to learn a
driving policy in urban settings with birds-eye-view (BEV) input images. The
ViT network learns the global context of the scene more effectively than with
earlier proposed Convolutional Neural Networks (ConvNets). Furthermore, ViT's
attention mechanism helps to learn an attention map for the scene which allows
the ego car to determine which surrounding cars are important to its next
decision. We demonstrate that a DQN agent with a ViT backbone outperforms
baseline algorithms with ConvNet backbones pre-trained in various ways. In
particular, the proposed method helps reinforcement learning algorithms to
learn faster, with increased performance and less data than baselines.
Related papers
- SPformer: A Transformer Based DRL Decision Making Method for Connected Automated Vehicles [9.840325772591024]
We propose a CAV decision-making architecture based on transformer and reinforcement learning algorithms.
A learnable policy token is used as the learning medium of the multi-vehicle joint policy.
Our model can make good use of all the state information of vehicles in traffic scenario.
arXiv Detail & Related papers (2024-09-23T15:16:35Z) - Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic Environment [2.3575550107698016]
We introduce an AV centrictemporal attention encoding (STAE) mechanism for learning dynamic interactions with different surrounding vehicles.
To understand map and route context, we employ a context encoder to extract context maps.
The resulting model is trained using the Soft Actor Critic (SAC) algorithm.
arXiv Detail & Related papers (2024-07-12T02:34:44Z) - DriveLM: Driving with Graph Visual Question Answering [57.51930417790141]
We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems.
We propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving.
arXiv Detail & Related papers (2023-12-21T18:59:12Z) - Deep Perspective Transformation Based Vehicle Localization on Bird's Eye
View [0.49747156441456597]
Traditional approaches rely on installing multiple sensors to simulate the environment.
We propose an alternative solution by generating a top-down representation of the scene.
We present an architecture that transforms perspective view RGB images into bird's-eye-view maps with segmented surrounding vehicles.
arXiv Detail & Related papers (2023-11-12T10:16:42Z) - Federated Deep Learning Meets Autonomous Vehicle Perception: Design and
Verification [168.67190934250868]
Federated learning empowered connected autonomous vehicle (FLCAV) has been proposed.
FLCAV preserves privacy while reducing communication and annotation costs.
It is challenging to determine the network resources and road sensor poses for multi-stage training.
arXiv Detail & Related papers (2022-06-03T23:55:45Z) - Structured Bird's-Eye-View Traffic Scene Understanding from Onboard
Images [128.881857704338]
We study the problem of extracting a directed graph representing the local road network in BEV coordinates, from a single onboard camera image.
We show that the method can be extended to detect dynamic objects on the BEV plane.
We validate our approach against powerful baselines and show that our network achieves superior performance.
arXiv Detail & Related papers (2021-10-05T12:40:33Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z) - Connecting Language and Vision for Natural Language-Based Vehicle
Retrieval [77.88818029640977]
In this paper, we apply one new modality, i.e., the language description, to search the vehicle of interest.
To connect language and vision, we propose to jointly train the state-of-the-art vision models with the transformer-based language model.
Our proposed method has achieved the 1st place on the 5th AI City Challenge, yielding competitive performance 18.69% MRR accuracy.
arXiv Detail & Related papers (2021-05-31T11:42:03Z) - Increasing the Efficiency of Policy Learning for Autonomous Vehicles by
Multi-Task Representation Learning [17.825845543579195]
We propose to learn a low-dimensional and rich latent representation of the environment by leveraging the knowledge of relevant semantic factors.
We also propose a hazard signal in addition to the learned latent representation as input to a down-stream policy.
In particular, the proposed representation learning and the hazard signal help reinforcement learning to learn faster, with increased performance and less data than baseline methods.
arXiv Detail & Related papers (2021-03-26T20:16:59Z) - Autonomous Navigation through intersections with Graph
ConvolutionalNetworks and Conditional Imitation Learning for Self-driving
Cars [10.080958939027363]
In autonomous driving, navigation through unsignaled intersections is a challenging task.
We propose a novel branched network G-CIL for the navigation policy learning.
Our end-to-end trainable neural network outperforms the baselines with higher success rate and shorter navigation time.
arXiv Detail & Related papers (2021-02-01T07:33:12Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.