Related papers: GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models

GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models

URL: http://arxiv.org/abs/2312.03543v1
Date: Wed, 6 Dec 2023 15:14:30 GMT
Title: GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models
Authors: Haicheng Liao, Huanming Shen, Zhenning Li, Chengyue Wang, Guofa Li, Yiming Bie, Chengzhong Xu
Abstract summary: This paper introduces a sophisticated encoder-decoder framework to address visual grounding in autonomous vehicles (AVs) Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder.
Score: 17.488420164181463
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs.Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Image, Context, and Cross-Modal-with a Multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments. The code for the proposed model is available at our Github.

Related papers

VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion [5.6565850326929485]
We propose a novel framework that uses Vision-Language Models to enhance training by providing attentional cues. Our method integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision. We evaluate VLM-E2E on the nuScenes dataset and demonstrate its superiority over state-of-the-art approaches.
arXiv Detail & Related papers (2025-02-25T10:02:12Z)
ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships. Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands. We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z)
Optimizing Vision-Language Interactions Through Decoder-Only Models [4.219163079329444]
MUDAIF is a vision-language model that seamlessly integrates visual and textual inputs. It achieves enhanced efficiency, flexibility, and cross-modal understanding. It is trained on a large-scale dataset of 45M image-text pairs.
arXiv Detail & Related papers (2024-12-14T09:04:32Z)
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0. InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z)
SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection [4.930667479611019]
This paper introduces SJTU: Spatial Judgments in Multimodal Models - Towards Unified through Coordinate Detection. It presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space. We demonstrate superior performance across benchmark datasets, achieving IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC.
arXiv Detail & Related papers (2024-12-03T16:53:58Z)
Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z)
Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic Environment [2.3575550107698016]
We introduce an AV centrictemporal attention encoding (STAE) mechanism for learning dynamic interactions with different surrounding vehicles. To understand map and route context, we employ a context encoder to extract context maps. The resulting model is trained using the Soft Actor Critic (SAC) algorithm.
arXiv Detail & Related papers (2024-07-12T02:34:44Z)
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z)
Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models [41.64717254672843]
Visual grounding occupies a pivotal position in multi-modality vision-language models. We propose ViLaM, a large multi-modality model, that supports multi-tasks of VG. ViLaM extends a wide range of instructions, thereby significantly enhancing its generalization and interaction potentials.
arXiv Detail & Related papers (2023-11-21T03:40:09Z)
Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision [24.90534567531536]
We propose an Iterative Robust Visual Grounding (IR-VG) framework with Masked Reference based Centerpoint Supervision (MRCS) The proposed framework is evaluated on five regular VG datasets and two newly constructed robust VG datasets.
arXiv Detail & Related papers (2023-07-23T17:55:24Z)
Cross-Language Speech Emotion Recognition Using Multimodal Dual Attention Transformers [5.538923337818467]
State-of-the-art systems are unable to achieve improved performance in cross-language settings. We propose a Multimodal Dual Attention Transformer model to improve cross-language SER.
arXiv Detail & Related papers (2023-06-23T22:38:32Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks. Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
Context-Aware Timewise VAEs for Real-Time Vehicle Trajectory Prediction [4.640835690336652]
We present ContextVAE, a context-aware approach for multi-modal vehicle trajectory prediction. Our approach takes into account both the social features exhibited by agents on the scene and the physical environment constraints. In all tested datasets, ContextVAE models are fast to train and provide high-quality multi-modal predictions in real-time.
arXiv Detail & Related papers (2023-02-21T18:42:24Z)
Semantic-aware Modular Capsule Routing for Visual Question Answering [55.03883681191765]
We propose a Semantic-aware modUlar caPsulE framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics. We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets.
arXiv Detail & Related papers (2022-07-21T10:48:37Z)
An Empirical Study of Training End-to-End Vision-and-Language Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.