TransLLM: A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting
- URL: http://arxiv.org/abs/2508.14782v1
- Date: Wed, 20 Aug 2025 15:27:49 GMT
- Title: TransLLM: A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting
- Authors: Jiaming Leng, Yunying Bi, Chuan Qin, Bing Yin, Yanyong Zhang, Chao Wang,
- Abstract summary: Small-scale deep learning models are task-hungry and data-hungry, limiting their generalizability across diverse scenarios.<n>We propose TransLLM, a unified framework that integrates modeling with large language models through learnable prompt composition.<n>Our approach features a lightweight encoder that captures complex dependencies via dilated temporal convolutions and dual-adjacency graph attention networks.
- Score: 26.764515296168145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Urban transportation systems encounter diverse challenges across multiple tasks, such as traffic forecasting, electric vehicle (EV) charging demand prediction, and taxi dispatch. Existing approaches suffer from two key limitations: small-scale deep learning models are task-specific and data-hungry, limiting their generalizability across diverse scenarios, while large language models (LLMs), despite offering flexibility through natural language interfaces, struggle with structured spatiotemporal data and numerical reasoning in transportation domains. To address these limitations, we propose TransLLM, a unified foundation framework that integrates spatiotemporal modeling with large language models through learnable prompt composition. Our approach features a lightweight spatiotemporal encoder that captures complex dependencies via dilated temporal convolutions and dual-adjacency graph attention networks, seamlessly interfacing with LLMs through structured embeddings. A novel instance-level prompt routing mechanism, trained via reinforcement learning, dynamically personalizes prompts based on input characteristics, moving beyond fixed task-specific templates. The framework operates by encoding spatiotemporal patterns into contextual representations, dynamically composing personalized prompts to guide LLM reasoning, and projecting the resulting representations through specialized output layers to generate task-specific predictions. Experiments across seven datasets and three tasks demonstrate the exceptional effectiveness of TransLLM in both supervised and zero-shot settings. Compared to ten baseline models, it delivers competitive performance on both regression and planning problems, showing strong generalization and cross-task adaptability. Our code is available at https://github.com/BiYunying/TransLLM.
Related papers
- MTDrive: Multi-turn Interactive Reinforcement Learning for Autonomous Driving [12.330414519761524]
Trajectory planning is a core task in autonomous driving.<n>MLLMs with Reinforcement Learning have shown promise in addressing "long-tail" scenarios.<n>We present MTDrive, a multi-turn framework that enables MLLMs to iteratively refine trajectories based on environmental feedback.
arXiv Detail & Related papers (2026-01-30T12:47:55Z) - AR-MOT: Autoregressive Multi-object Tracking [56.09738000988466]
We propose a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework.<n>This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads.<n>To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector.
arXiv Detail & Related papers (2026-01-05T09:17:28Z) - Vision-LLMs for Spatiotemporal Traffic Forecasting [14.700408329373998]
Large Language Models (LLMs) inherently struggle to model the complex spatial dependencies of grid-based traffic data.<n>We propose ST-Vision-LLM, a novel framework reframe thatstemporal forecasting as a vision-language fusion problem.<n>We show that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the second-best baseline by over 30.04% in cross-domain scenarios.
arXiv Detail & Related papers (2025-10-13T11:15:56Z) - VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning [10.497961559068493]
Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes.<n>Existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage.<n>VisualTrans is the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios.
arXiv Detail & Related papers (2025-08-06T03:07:05Z) - UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling.<n>Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning.<n>We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z) - Diffusion Augmented Retrieval: A Training-Free Approach to Interactive Text-to-Image Retrieval [7.439049772394586]
Diffusion Augmented Retrieval (DAR) is a framework that generates multiple intermediate representations via dialogue refinements and DMs.<n>DAR results on par with finetuned I-TIR models, yet without incurring their tuning overhead.
arXiv Detail & Related papers (2025-01-26T03:29:18Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model [27.56988000960972]
We introduce a new framework based on a dual context of both domain-shared and class-specific contexts.
Such dual prompt methods enhance the model's feature representation by joining implicit and explicit factors encoded in Large Language Models.
We also formulate the Unbalanced Optimal Transport (UOT) theory to quantify the relationships between constructed prompts and visual tokens.
arXiv Detail & Related papers (2024-07-05T13:15:29Z) - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model.
We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z) - Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks.
We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset.
The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.