Traffic-MLLM: A Spatio-Temporal MLLM with Retrieval-Augmented Generation for Causal Inference in Traffic
- URL: http://arxiv.org/abs/2509.11165v1
- Date: Sun, 14 Sep 2025 08:53:06 GMT
- Title: Traffic-MLLM: A Spatio-Temporal MLLM with Retrieval-Augmented Generation for Causal Inference in Traffic
- Authors: Waikit Xiu, Qiang Lu, Xiying Li, Chen Hu, Shengbo Sun,
- Abstract summary: We propose Traffic-LM, a multimodal large language model tailored for fine-grained traffic analysis.<n>Our model leverages high-quality traffic-specific multimodal datasets and uses LowRanktemporal Adaptation (LoRA) for lightweight fine-tuning.<n>We also introduce an innovative knowledge module fusing Chain-of-the-art reasoning with Retrieval-Lomented Generation (LoRAG)
- Score: 8.754321713184483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As intelligent transportation systems advance, traffic video understanding plays an increasingly pivotal role in comprehensive scene perception and causal analysis. Yet, existing approaches face notable challenges in accurately modeling spatiotemporal causality and integrating domain-specific knowledge, limiting their effectiveness in complex scenarios. To address these limitations, we propose Traffic-MLLM, a multimodal large language model tailored for fine-grained traffic analysis. Built on the Qwen2.5-VL backbone, our model leverages high-quality traffic-specific multimodal datasets and uses Low-Rank Adaptation (LoRA) for lightweight fine-tuning, significantly enhancing its capacity to model continuous spatiotemporal features in video sequences. Furthermore, we introduce an innovative knowledge prompting module fusing Chain-of-Thought (CoT) reasoning with Retrieval-Augmented Generation (RAG), enabling precise injection of detailed traffic regulations and domain knowledge into the inference process. This design markedly boosts the model's logical reasoning and knowledge adaptation capabilities. Experimental results on TrafficQA and DriveQA benchmarks show Traffic-MLLM achieves state-of-the-art performance, validating its superior ability to process multimodal traffic data. It also exhibits remarkable zero-shot reasoning and cross-scenario generalization capabilities.
Related papers
- Wireless Traffic Prediction with Large Language Model [54.07581399989292]
TIDES is a novel framework that captures spatial-temporal correlations for wireless traffic prediction.<n> TIDES achieves efficient adaptation to domain-specific patterns without incurring excessive training overhead.<n>Our results indicate that integrating spatial awareness into LLM-based predictors is the key to unlocking scalable and intelligent network management in future 6G systems.
arXiv Detail & Related papers (2025-12-19T04:47:40Z) - RoadSceneVQA: Benchmarking Visual Question Answering in Roadside Perception Systems for Intelligent Transportation System [15.222742182076459]
RoadSceneVQA is a large-scale visual question answering dataset specifically tailored for roadside scenarios.<n>The dataset comprises 34,736 diverse QA pairs collected under varying weather, illumination, and traffic conditions.<n>RoadSceneVQA challenges models to perform both explicit recognition and implicit commonsense reasoning.
arXiv Detail & Related papers (2025-11-23T04:40:50Z) - Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap [51.198001060683296]
Large Language Models (LLMs) offer transformative potential to address transportation challenges.<n>This survey first presents LLM4TR, a novel conceptual framework that systematically categorizes the roles of LLMs in transportation.<n>For each role, our review spans diverse applications, from traffic prediction and autonomous driving to safety analytics and urban mobility optimization.
arXiv Detail & Related papers (2025-03-27T11:56:27Z) - TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning [61.33599727106222]
TeLL-Drive is a hybrid framework that integrates a Teacher LLM to guide an attention-based Student DRL policy.<n>A self-attention mechanism then fuses these strategies with the DRL agent's exploration, accelerating policy convergence and boosting robustness.
arXiv Detail & Related papers (2025-02-03T14:22:03Z) - Strada-LLM: Graph LLM for traffic prediction [62.2015839597764]
A considerable challenge in traffic prediction lies in handling the diverse data distributions caused by vastly different traffic conditions.<n>We propose a graph-aware LLM for traffic prediction that considers proximal traffic information.<n>We adopt a lightweight approach for efficient domain adaptation when facing new data distributions in few-shot fashion.
arXiv Detail & Related papers (2024-10-28T09:19:29Z) - Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving.
Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z) - Towards Explainable Traffic Flow Prediction with Large Language Models [36.86937188565623]
We propose a Traffic flow Prediction model based on Large Language Models (LLMs) to generate explainable traffic predictions.
By transferring multi-modal traffic data into natural language descriptions, xTP-LLM captures complex time-series patterns and external factors from comprehensive traffic data.
Empirically, xTP-LLM shows competitive accuracy compared with deep learning baselines, while providing an intuitive and reliable explanation for predictions.
arXiv Detail & Related papers (2024-04-03T07:14:15Z) - A Holistic Framework Towards Vision-based Traffic Signal Control with
Microscopic Simulation [53.39174966020085]
Traffic signal control (TSC) is crucial for reducing traffic congestion that leads to smoother traffic flow, reduced idling time, and mitigated CO2 emissions.
In this study, we explore the computer vision approach for TSC that modulates on-road traffic flows through visual observation.
We introduce a holistic traffic simulation framework called TrafficDojo towards vision-based TSC and its benchmarking.
arXiv Detail & Related papers (2024-03-11T16:42:29Z) - TPLLM: A Traffic Prediction Framework Based on Pretrained Large Language Models [27.306180426294784]
We introduce TPLLM, a novel traffic prediction framework leveraging Large Language Models (LLMs)
In this framework, we construct a sequence embedding layer based on Conal Neural Networks (LoCNNs) and a graph embedding layer based on Graph Contemporalal Networks (GCNs) to extract sequence features and spatial features.
Experiments on two real-world datasets demonstrate commendable performance in both full-sample and few-shot prediction scenarios.
arXiv Detail & Related papers (2024-03-04T17:08:57Z) - Language-Guided Traffic Simulation via Scene-Level Diffusion [46.47977644226296]
We present CTG++, a scene-level conditional diffusion model that can be guided by language instructions.
We first propose a scene-level diffusion model equipped with atemporal backbone which generates realistic and controllable traffic.
We then harness a large language model (LLM) to convert a users query into a loss function guiding the diffusion model towards query-compliant generation.
arXiv Detail & Related papers (2023-06-10T05:20:30Z) - Guided Conditional Diffusion for Controllable Traffic Simulation [42.198185904248994]
Controllable and realistic traffic simulation is critical for developing and verifying autonomous vehicles.
Data-driven approaches generate realistic and human-like behaviors, improving transfer from simulated to real-world traffic.
We develop a conditional diffusion model for controllable traffic generation (CTG) that allows users to control desired properties of trajectories at test time.
arXiv Detail & Related papers (2022-10-31T14:44:59Z) - Multi-intersection Traffic Optimisation: A Benchmark Dataset and a
Strong Baseline [85.9210953301628]
Control of traffic signals is fundamental and critical to alleviate traffic congestion in urban areas.
Because of the high complexity of modelling the problem, experimental settings of current works are often inconsistent.
We propose a novel and strong baseline model based on deep reinforcement learning with the encoder-decoder structure.
arXiv Detail & Related papers (2021-01-24T03:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.