Semantic Edge-Cloud Communication for Real-Time Urban Traffic Surveillance with ViT and LLMs over Mobile Networks
- URL: http://arxiv.org/abs/2509.21259v1
- Date: Thu, 25 Sep 2025 14:53:36 GMT
- Title: Semantic Edge-Cloud Communication for Real-Time Urban Traffic Surveillance with ViT and LLMs over Mobile Networks
- Authors: Murat Arda Onsu, Poonam Lohan, Burak Kantarci, Aisha Syed, Matthew Andrews, Sean Kennedy,
- Abstract summary: Real-time urban traffic surveillance is vital for Intelligent Transportation Systems (ITS) to ensure road safety, optimize traffic flow, track vehicle trajectories, and prevent collisions in smart cities.<n>We propose a semantic communication framework that significantly reduces transmission overhead.<n>This approach achieves a 99.9% reduction in data transmission size while maintaining an LLM response accuracy of 89% for reconstructed cropped images, compared to 93% accuracy with original cropped images.
- Score: 5.862522659881676
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Real-time urban traffic surveillance is vital for Intelligent Transportation Systems (ITS) to ensure road safety, optimize traffic flow, track vehicle trajectories, and prevent collisions in smart cities. Deploying edge cameras across urban environments is a standard practice for monitoring road conditions. However, integrating these with intelligent models requires a robust understanding of dynamic traffic scenarios and a responsive interface for user interaction. Although multimodal Large Language Models (LLMs) can interpret traffic images and generate informative responses, their deployment on edge devices is infeasible due to high computational demands. Therefore, LLM inference must occur on the cloud, necessitating visual data transmission from edge to cloud, a process hindered by limited bandwidth, leading to potential delays that compromise real-time performance. To address this challenge, we propose a semantic communication framework that significantly reduces transmission overhead. Our method involves detecting Regions of Interest (RoIs) using YOLOv11, cropping relevant image segments, and converting them into compact embedding vectors using a Vision Transformer (ViT). These embeddings are then transmitted to the cloud, where an image decoder reconstructs the cropped images. The reconstructed images are processed by a multimodal LLM to generate traffic condition descriptions. This approach achieves a 99.9% reduction in data transmission size while maintaining an LLM response accuracy of 89% for reconstructed cropped images, compared to 93% accuracy with original cropped images. Our results demonstrate the efficiency and practicality of ViT and LLM-assisted edge-cloud semantic communication for real-time traffic surveillance.
Related papers
- Wireless Traffic Prediction with Large Language Model [54.07581399989292]
TIDES is a novel framework that captures spatial-temporal correlations for wireless traffic prediction.<n> TIDES achieves efficient adaptation to domain-specific patterns without incurring excessive training overhead.<n>Our results indicate that integrating spatial awareness into LLM-based predictors is the key to unlocking scalable and intelligent network management in future 6G systems.
arXiv Detail & Related papers (2025-12-19T04:47:40Z) - TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs [8.205106134817763]
efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data.<n>To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections.<n> Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4times$ while maintaining information accuracy.
arXiv Detail & Related papers (2025-11-26T01:34:08Z) - Scaling Traffic Insights with AI and Language Model-Powered Camera Systems for Data-Driven Transportation Decision Making [3.0273878903284266]
This study presents an end-to-end AI-based framework for high-resolution, longitudinal analysis at scale.<n>A fine-tuned YOLOv11 model, trained on localized urban scenes, extracts multimodal traffic density and classification metrics in real time.<n>We validated the system using over 9 million images from roughly 1,000 traffic cameras during the early rollout of NYC congestion pricing in 2025.
arXiv Detail & Related papers (2025-10-11T03:18:42Z) - When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis [6.213279061986497]
SeeUnsafe is a framework that transforms video-based traffic accident analysis into a more interactive, conversational approach.<n>Our framework employs a multimodal-based aggregation strategy to handle videos of various lengths and generate structured responses for review and evaluation.<n>We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding.
arXiv Detail & Related papers (2025-01-17T23:35:34Z) - Improving Traffic Flow Predictions with SGCN-LSTM: A Hybrid Model for Spatial and Temporal Dependencies [55.2480439325792]
This paper introduces the Signal-Enhanced Graph Convolutional Network Long Short Term Memory (SGCN-LSTM) model for predicting traffic speeds across road networks.
Experiments on the PEMS-BAY road network traffic dataset demonstrate the SGCN-LSTM model's effectiveness.
arXiv Detail & Related papers (2024-11-01T00:37:00Z) - Towards Multi-agent Reinforcement Learning based Traffic Signal Control through Spatio-temporal Hypergraphs [19.107744041461316]
Traffic signal systems (TSCSs) are integral to intelligent traffic management fostering efficient vehicle flow.<n>We propose a novel TSCS framework to realize intelligent traffic edge network.<n>We have crafted a multi-agent soft actor-critic (MA-SAC) reinforcement learning algorithm.
arXiv Detail & Related papers (2024-04-17T02:46:18Z) - Elastic Interaction Energy-Informed Real-Time Traffic Scene Perception [8.429178814528617]
A topology-aware energy loss function-based network training strategy named EIEGSeg is proposed.
EIEGSeg is designed for multi-class segmentation on real-time traffic scene perception.
Our results demonstrate that EIEGSeg consistently improves the performance, especially on real-time, lightweight networks.
arXiv Detail & Related papers (2023-10-02T01:30:42Z) - Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - Road Network Guided Fine-Grained Urban Traffic Flow Inference [108.64631590347352]
Accurate inference of fine-grained traffic flow from coarse-grained one is an emerging yet crucial problem.
We propose a novel Road-Aware Traffic Flow Magnifier (RATFM) that exploits the prior knowledge of road networks.
Our method can generate high-quality fine-grained traffic flow maps.
arXiv Detail & Related papers (2021-09-29T07:51:49Z) - Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [59.60483620730437]
We propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention.
Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
arXiv Detail & Related papers (2021-04-19T11:48:13Z) - Learning dynamic and hierarchical traffic spatiotemporal features with
Transformer [4.506591024152763]
This paper proposes a novel model, Traffic Transformer, for spatial-temporal graph modeling and long-term traffic forecasting.
Transformer is the most popular framework in Natural Language Processing (NLP)
analyzing the attention weight matrixes can find the influential part of road networks, allowing us to learn the traffic networks better.
arXiv Detail & Related papers (2021-04-12T02:29:58Z) - Deep traffic light detection by overlaying synthetic context on
arbitrary natural images [49.592798832978296]
We propose a method to generate artificial traffic-related training data for deep traffic light detectors.
This data is generated using basic non-realistic computer graphics to blend fake traffic scenes on top of arbitrary image backgrounds.
It also tackles the intrinsic data imbalance problem in traffic light datasets, caused mainly by the low amount of samples of the yellow state.
arXiv Detail & Related papers (2020-11-07T19:57:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.