GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging
Cross-Modal Attention with Large Language Models
- URL: http://arxiv.org/abs/2312.03543v1
- Date: Wed, 6 Dec 2023 15:14:30 GMT
- Title: GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging
Cross-Modal Attention with Large Language Models
- Authors: Haicheng Liao, Huanming Shen, Zhenning Li, Chengyue Wang, Guofa Li,
Yiming Bie, Chengzhong Xu
- Abstract summary: This paper introduces a sophisticated encoder-decoder framework to address visual grounding in autonomous vehicles (AVs)
Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder.
- Score: 17.488420164181463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the field of autonomous vehicles (AVs), accurately discerning commander
intent and executing linguistic commands within a visual context presents a
significant challenge. This paper introduces a sophisticated encoder-decoder
framework, developed to address visual grounding in AVs.Our Context-Aware
Visual Grounding (CAVG) model is an advanced system that integrates five core
encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder. This
integration enables the CAVG model to adeptly capture contextual semantics and
to learn human emotional features, augmented by state-of-the-art Large Language
Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the
implementation of multi-head cross-modal attention mechanisms and a
Region-Specific Dynamic (RSD) layer for attention modulation. This
architectural design enables the model to efficiently process and interpret a
range of cross-modal inputs, yielding a comprehensive understanding of the
correlation between verbal commands and corresponding visual scenes. Empirical
evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that
CAVG establishes new standards in prediction accuracy and operational
efficiency. Notably, the model exhibits exceptional performance even with
limited training data, ranging from 50% to 75% of the full dataset. This
feature highlights its effectiveness and potential for deployment in practical
AV applications. Moreover, CAVG has shown remarkable robustness and
adaptability in challenging scenarios, including long-text command
interpretation, low-light conditions, ambiguous command contexts, inclement
weather conditions, and densely populated urban environments. The code for the
proposed model is available at our Github.
Related papers
- Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic Environment [2.3575550107698016]
We introduce an AV centrictemporal attention encoding (STAE) mechanism for learning dynamic interactions with different surrounding vehicles.
To understand map and route context, we employ a context encoder to extract context maps.
The resulting model is trained using the Soft Actor Critic (SAC) algorithm.
arXiv Detail & Related papers (2024-07-12T02:34:44Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models [41.64717254672843]
Visual grounding occupies a pivotal position in multi-modality vision-language models.
We propose ViLaM, a large multi-modality model, that supports multi-tasks of VG.
ViLaM extends a wide range of instructions, thereby significantly enhancing its generalization and interaction potentials.
arXiv Detail & Related papers (2023-11-21T03:40:09Z) - Iterative Robust Visual Grounding with Masked Reference based
Centerpoint Supervision [24.90534567531536]
We propose an Iterative Robust Visual Grounding (IR-VG) framework with Masked Reference based Centerpoint Supervision (MRCS)
The proposed framework is evaluated on five regular VG datasets and two newly constructed robust VG datasets.
arXiv Detail & Related papers (2023-07-23T17:55:24Z) - Cross-Language Speech Emotion Recognition Using Multimodal Dual
Attention Transformers [5.538923337818467]
State-of-the-art systems are unable to achieve improved performance in cross-language settings.
We propose a Multimodal Dual Attention Transformer model to improve cross-language SER.
arXiv Detail & Related papers (2023-06-23T22:38:32Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Context-Aware Timewise VAEs for Real-Time Vehicle Trajectory Prediction [4.640835690336652]
We present ContextVAE, a context-aware approach for multi-modal vehicle trajectory prediction.
Our approach takes into account both the social features exhibited by agents on the scene and the physical environment constraints.
In all tested datasets, ContextVAE models are fast to train and provide high-quality multi-modal predictions in real-time.
arXiv Detail & Related papers (2023-02-21T18:42:24Z) - Semantic-aware Modular Capsule Routing for Visual Question Answering [55.03883681191765]
We propose a Semantic-aware modUlar caPsulE framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics.
We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets.
arXiv Detail & Related papers (2022-07-21T10:48:37Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.