Related papers: Vision Transformer for Learning Driving Policies in Complex Multi-Agent Environments

Vision Transformer for Learning Driving Policies in Complex Multi-Agent Environments

URL: http://arxiv.org/abs/2109.06514v1
Date: Tue, 14 Sep 2021 08:18:47 GMT
Title: Vision Transformer for Learning Driving Policies in Complex Multi-Agent Environments
Authors: Eshagh Kargar, Ville Kyrki
Abstract summary: We propose to use Vision Transformer (ViT) to learn a driving policy in urban settings with birds-eye-view (BEV) input images. The ViT network learns the global context of the scene more effectively than with earlier proposed Convolutional Neural Networks (ConvNets)
Score: 17.825845543579195
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Driving in a complex urban environment is a difficult task that requires a complex decision policy. In order to make informed decisions, one needs to gain an understanding of the long-range context and the importance of other vehicles. In this work, we propose to use Vision Transformer (ViT) to learn a driving policy in urban settings with birds-eye-view (BEV) input images. The ViT network learns the global context of the scene more effectively than with earlier proposed Convolutional Neural Networks (ConvNets). Furthermore, ViT's attention mechanism helps to learn an attention map for the scene which allows the ego car to determine which surrounding cars are important to its next decision. We demonstrate that a DQN agent with a ViT backbone outperforms baseline algorithms with ConvNet backbones pre-trained in various ways. In particular, the proposed method helps reinforcement learning algorithms to learn faster, with increased performance and less data than baselines.

Related papers

ChatBEV: A Visual Language Model that Understands BEV Maps [58.3005092762598]
We introduce ChatBEV-QA, a novel BEV VQA benchmark containing over 137k questions. This benchmark is constructed using a novel data collection pipeline that generates scalable and informative VQA data for BEV maps. We propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance.
arXiv Detail & Related papers (2025-03-18T06:12:38Z)
SPformer: A Transformer Based DRL Decision Making Method for Connected Automated Vehicles [9.840325772591024]
We propose a CAV decision-making architecture based on transformer and reinforcement learning algorithms. A learnable policy token is used as the learning medium of the multi-vehicle joint policy. Our model can make good use of all the state information of vehicles in traffic scenario.
arXiv Detail & Related papers (2024-09-23T15:16:35Z)
Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic Environment [2.3575550107698016]
We introduce an AV centrictemporal attention encoding (STAE) mechanism for learning dynamic interactions with different surrounding vehicles. To understand map and route context, we employ a context encoder to extract context maps. The resulting model is trained using the Soft Actor Critic (SAC) algorithm.
arXiv Detail & Related papers (2024-07-12T02:34:44Z)
DriveLM: Driving with Graph Visual Question Answering [57.51930417790141]
We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems. We propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving.
arXiv Detail & Related papers (2023-12-21T18:59:12Z)
Leveraging Driver Field-of-View for Multimodal Ego-Trajectory Prediction [69.29802752614677]
RouteFormer is a novel ego-trajectory prediction network combining GPS data, environmental context, and the driver's field-of-view. To tackle data scarcity and enhance diversity, we introduce GEM, a dataset of urban driving scenarios enriched with synchronized driver field-of-view and gaze data.
arXiv Detail & Related papers (2023-12-13T23:06:30Z)
Deep Perspective Transformation Based Vehicle Localization on Bird's Eye View [0.49747156441456597]
Traditional approaches rely on installing multiple sensors to simulate the environment. We propose an alternative solution by generating a top-down representation of the scene. We present an architecture that transforms perspective view RGB images into bird's-eye-view maps with segmented surrounding vehicles.
arXiv Detail & Related papers (2023-11-12T10:16:42Z)
Federated Deep Learning Meets Autonomous Vehicle Perception: Design and Verification [168.67190934250868]
Federated learning empowered connected autonomous vehicle (FLCAV) has been proposed. FLCAV preserves privacy while reducing communication and annotation costs. It is challenging to determine the network resources and road sensor poses for multi-stage training.
arXiv Detail & Related papers (2022-06-03T23:55:45Z)
Structured Bird's-Eye-View Traffic Scene Understanding from Onboard Images [128.881857704338]
We study the problem of extracting a directed graph representing the local road network in BEV coordinates, from a single onboard camera image. We show that the method can be extended to detect dynamic objects on the BEV plane. We validate our approach against powerful baselines and show that our network achieves superior performance.
arXiv Detail & Related papers (2021-10-05T12:40:33Z)
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs. CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z)
Connecting Language and Vision for Natural Language-Based Vehicle Retrieval [77.88818029640977]
In this paper, we apply one new modality, i.e., the language description, to search the vehicle of interest. To connect language and vision, we propose to jointly train the state-of-the-art vision models with the transformer-based language model. Our proposed method has achieved the 1st place on the 5th AI City Challenge, yielding competitive performance 18.69% MRR accuracy.
arXiv Detail & Related papers (2021-05-31T11:42:03Z)
Increasing the Efficiency of Policy Learning for Autonomous Vehicles by Multi-Task Representation Learning [17.825845543579195]
We propose to learn a low-dimensional and rich latent representation of the environment by leveraging the knowledge of relevant semantic factors. We also propose a hazard signal in addition to the learned latent representation as input to a down-stream policy. In particular, the proposed representation learning and the hazard signal help reinforcement learning to learn faster, with increased performance and less data than baseline methods.
arXiv Detail & Related papers (2021-03-26T20:16:59Z)
Autonomous Navigation through intersections with Graph ConvolutionalNetworks and Conditional Imitation Learning for Self-driving Cars [10.080958939027363]
In autonomous driving, navigation through unsignaled intersections is a challenging task. We propose a novel branched network G-CIL for the navigation policy learning. Our end-to-end trainable neural network outperforms the baselines with higher success rate and shorter navigation time.
arXiv Detail & Related papers (2021-02-01T07:33:12Z)
Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images. Our approach is fully automatic without any human interaction. We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.