GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging
Cross-Modal Attention with Large Language Models
- URL: http://arxiv.org/abs/2312.03543v1
- Date: Wed, 6 Dec 2023 15:14:30 GMT
- Title: GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging
Cross-Modal Attention with Large Language Models
- Authors: Haicheng Liao, Huanming Shen, Zhenning Li, Chengyue Wang, Guofa Li,
Yiming Bie, Chengzhong Xu
- Abstract summary: This paper introduces a sophisticated encoder-decoder framework to address visual grounding in autonomous vehicles (AVs)
Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder.
- Score: 17.488420164181463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the field of autonomous vehicles (AVs), accurately discerning commander
intent and executing linguistic commands within a visual context presents a
significant challenge. This paper introduces a sophisticated encoder-decoder
framework, developed to address visual grounding in AVs.Our Context-Aware
Visual Grounding (CAVG) model is an advanced system that integrates five core
encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder. This
integration enables the CAVG model to adeptly capture contextual semantics and
to learn human emotional features, augmented by state-of-the-art Large Language
Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the
implementation of multi-head cross-modal attention mechanisms and a
Region-Specific Dynamic (RSD) layer for attention modulation. This
architectural design enables the model to efficiently process and interpret a
range of cross-modal inputs, yielding a comprehensive understanding of the
correlation between verbal commands and corresponding visual scenes. Empirical
evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that
CAVG establishes new standards in prediction accuracy and operational
efficiency. Notably, the model exhibits exceptional performance even with
limited training data, ranging from 50% to 75% of the full dataset. This
feature highlights its effectiveness and potential for deployment in practical
AV applications. Moreover, CAVG has shown remarkable robustness and
adaptability in challenging scenarios, including long-text command
interpretation, low-light conditions, ambiguous command contexts, inclement
weather conditions, and densely populated urban environments. The code for the
proposed model is available at our Github.
Related papers
- Optimizing Vision-Language Interactions Through Decoder-Only Models [4.219163079329444]
MUDAIF is a vision-language model that seamlessly integrates visual and textual inputs.
It achieves enhanced efficiency, flexibility, and cross-modal understanding.
It is trained on a large-scale dataset of 45M image-text pairs.
arXiv Detail & Related papers (2024-12-14T09:04:32Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.
InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.
We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection [4.930667479611019]
This paper introduces SJTU: Spatial Judgments in Multimodal Models - Towards Unified through Coordinate Detection.
It presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space.
We demonstrate superior performance across benchmark datasets, achieving IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC.
arXiv Detail & Related papers (2024-12-03T16:53:58Z) - MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding [6.538592344967826]
We introduce MUSE-VL, a Unified Vision-Language Model Semantic through discrete -language for multimodal understanding and generation.
The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.
arXiv Detail & Related papers (2024-11-26T03:33:52Z) - Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process.
Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z) - Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic Environment [2.3575550107698016]
We introduce an AV centrictemporal attention encoding (STAE) mechanism for learning dynamic interactions with different surrounding vehicles.
To understand map and route context, we employ a context encoder to extract context maps.
The resulting model is trained using the Soft Actor Critic (SAC) algorithm.
arXiv Detail & Related papers (2024-07-12T02:34:44Z) - Cross-Language Speech Emotion Recognition Using Multimodal Dual
Attention Transformers [5.538923337818467]
State-of-the-art systems are unable to achieve improved performance in cross-language settings.
We propose a Multimodal Dual Attention Transformer model to improve cross-language SER.
arXiv Detail & Related papers (2023-06-23T22:38:32Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Semantic-aware Modular Capsule Routing for Visual Question Answering [55.03883681191765]
We propose a Semantic-aware modUlar caPsulE framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics.
We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets.
arXiv Detail & Related papers (2022-07-21T10:48:37Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.