Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
- URL: http://arxiv.org/abs/2601.10551v1
- Date: Thu, 15 Jan 2026 16:16:34 GMT
- Title: Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
- Authors: Luxuan Fu, Chong Liu, Bisheng Yang, Zhen Dong,
- Abstract summary: General-purpose models often struggle to capture the necessary fine-grained attributes and domain rules.<n>We propose a domain-adapted framework that transforms Large Vision Language Models into specialized agents for intelligent infrastructure analysis.<n>Our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%.
- Score: 12.667510244197047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
Related papers
- Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models [15.709482146201283]
A simple linear classifier trained on the frozen features of modern Vision Foundation Models establishes a new state-of-the-art.<n>We show that this baseline matches specialized detectors on standard benchmarks but also decisively outperforms them on in-the-wild datasets.<n>We conclude by advocating for a paradigm shift in AI forensics, moving from overfitting on static benchmarks to harnessing the evolving world knowledge of foundation models for real-world reliability.
arXiv Detail & Related papers (2026-02-02T07:20:02Z) - CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems [29.385460126069386]
We introduce a novel benchmark, CogRail, which integrates curated datasets with cognitively driven question-answer annotations.<n>Building upon this benchmark, we conduct a systematic evaluation of state-of-the-art visual-language models.<n>We propose a joint fine-tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis.
arXiv Detail & Related papers (2026-01-14T16:36:26Z) - SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection [55.54007781679915]
We propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception.<n>SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.
arXiv Detail & Related papers (2026-01-14T04:42:19Z) - Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems [75.78934957242403]
Self-driving vehicles and drones require true Spatial Intelligence from multi-modal onboard sensor data.<n>This paper presents a framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal.
arXiv Detail & Related papers (2025-12-30T17:58:01Z) - ORPR: An OR-Guided Pretrain-then-Reinforce Learning Model for Inventory Management [9.138155308817215]
"Pretrain-then-Reinforce" approach reconciles AI's adaptive perception with Operations Research's structural rigor.<n>We show that a lightweight, domain-informed model can deliver state-of-the-art performance and robust transferability when guided by structured OR logic.
arXiv Detail & Related papers (2025-12-22T03:39:43Z) - VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization [2.6678231901651723]
This paper introduces Vulnerability-Adaptive Policy Optimization (VULPO), an on-policy LLM reinforcement learning framework for context-aware vulnerability detection.<n>To support training and evaluation, we first construct ContextVul, a new dataset that augments high-quality function-level samples with lightweight method to extract repository-level context information.<n>To address the asymmetric difficulty of different vulnerability cases and mitigate reward hacking, VULPO incorporates label-level and sample-level difficulty-adaptive reward scaling.
arXiv Detail & Related papers (2025-11-14T21:57:48Z) - SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection [6.806105013817923]
SAVANT is a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios.<n>By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection.
arXiv Detail & Related papers (2025-10-20T19:14:29Z) - Agentic AI Reasoning for Mobile Edge General Intelligence: Fundamentals, Approaches, and Directions [74.35421055079655]
Large language models (LLMs) have enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities.<n>Mobile Edge General Intelligence (MEGI) brings real-time, privacy-preserving reasoning to the network edge.<n>We propose a joint optimization framework for efficient LLM reasoning deployment in MEGI.
arXiv Detail & Related papers (2025-09-27T10:53:48Z) - NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models [72.58372335140241]
Adversarial Prompt Tuning (AdvPT) introduced learnable text prompts to enhance adversarial robustness in Vision-Language Models (VLMs)<n>We present the Neural Augmentor framework for Multi-modal Adversarial Prompt Tuning (NAP-Tuning)<n>Our approach shows significant improvements over the strongest baselines under the challenging AutoAttack benchmark, outperforming them by 33.5% on ViT-B16 and 33.0% on ViT-B32 architectures.
arXiv Detail & Related papers (2025-06-15T03:34:23Z) - Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z) - Proficient Graph Neural Network Design by Accumulating Knowledge on Large Language Models [20.31388126105889]
DesiGNN is a knowledge-centered framework that converts past model design experiences into structured, fine-grained knowledge priors.<n>By constructing a solid meta-knowledge between unseen graph understanding and known effective architecture patterns, DesiGNN can deliver top-5.77% initial model proposals for unseen datasets within seconds.
arXiv Detail & Related papers (2024-08-13T08:22:01Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Semantic-aware Modular Capsule Routing for Visual Question Answering [55.03883681191765]
We propose a Semantic-aware modUlar caPsulE framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics.
We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets.
arXiv Detail & Related papers (2022-07-21T10:48:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.