Related papers: TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

URL: http://arxiv.org/abs/2511.20965v1
Date: Wed, 26 Nov 2025 01:34:08 GMT
Title: TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs
Authors: Md Adnan Arefeen, Biplob Debnath, Srimat Chakradhar,
Abstract summary: efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data.<n>To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections.<n> Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4times$ while maintaining information accuracy.
Score: 8.205106134817763
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4\times$ while maintaining information accuracy.

Related papers

1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning [53.28271278708241]
We present a Detector-Empowered Video LLM, short for DEViL.<n> DEViL couples a Video LLM with an open-vocabulary detector (OVD)<n>Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding.
arXiv Detail & Related papers (2025-12-07T06:11:15Z)
Aligning Effective Tokens with Video Anomaly in Large Language Models [42.99603812716817]
We propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos.<n>Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules.<n>We construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs.
arXiv Detail & Related papers (2025-08-08T14:30:05Z)
InterAct-Video: Reasoning-Rich Video QA for Urban Traffic [21.849445040376537]
Deep learning has advanced video-based traffic monitoring through question answering (VideoQA) models.<n>Existing VideoQA models struggle with the complexity of real-world traffic scenes.<n>InterAct VideoQA is a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks.
arXiv Detail & Related papers (2025-07-19T20:30:43Z)
Towards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework [62.47416496137193]
We propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop.<n>The architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime.
arXiv Detail & Related papers (2025-03-06T07:36:06Z)
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis [6.213279061986497]
SeeUnsafe is a framework that transforms video-based traffic accident analysis into a more interactive, conversational approach.<n>Our framework employs a multimodal-based aggregation strategy to handle videos of various lengths and generate structured responses for review and evaluation.<n>We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding.
arXiv Detail & Related papers (2025-01-17T23:35:34Z)
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos [58.765796160750504]
VideoGLaMM is a new model for fine-grained pixel-level grounding in videos based on user-provided textual inputs.<n>The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions.<n> Experimental results show that our model consistently outperforms existing approaches across all three tasks.
arXiv Detail & Related papers (2024-11-07T17:59:27Z)
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning [0.0]
We present TrafficVLM, a novel multi-modal dense video captioning model for vehicle ego camera view. Our solution achieved outstanding results in Track 2 of the AI City Challenge 2024, ranking us third in the challenge standings.
arXiv Detail & Related papers (2024-04-14T14:51:44Z)
BjTT: A Large-scale Multimodal Dataset for Traffic Prediction [49.93028461584377]
Traditional traffic prediction methods rely on historical traffic data to predict traffic trends. In this work, we explore how generative models combined with text describing the traffic system can be applied for traffic generation. We propose ChatTraffic, the first diffusion model for text-to-traffic generation.
arXiv Detail & Related papers (2024-03-08T04:19:56Z)
Traffic-Domain Video Question Answering with Automatic Captioning [69.98381847388553]
Video Question Answering (VidQA) exhibits remarkable potential in facilitating advanced machine reasoning capabilities. We present a novel approach termed Traffic-domain Video Question Answering with Automatic Captioning (TRIVIA), which serves as a weak-supervision technique for infusing traffic-domain knowledge into large video-language models.
arXiv Detail & Related papers (2023-07-18T20:56:41Z)
Traffic Scene Parsing through the TSP6K Dataset [109.69836680564616]
We introduce a specialized traffic monitoring dataset, termed TSP6K, with high-quality pixel-level and instance-level annotations. The dataset captures more crowded traffic scenes with several times more traffic participants than the existing driving scenes. We propose a detail refining decoder for scene parsing, which recovers the details of different semantic regions in traffic scenes.
arXiv Detail & Related papers (2023-03-06T02:05:14Z)
A novel efficient Multi-view traffic-related object detection framework [17.50049841016045]
We propose a novel traffic-related framework named CEVAS to achieve efficient object detection using multi-view video data. Results show that our framework significantly reduces response latency while achieving the same detection accuracy as the state-of-the-art methods.
arXiv Detail & Related papers (2023-02-23T06:42:37Z)
Scalable and Real-time Multi-Camera Vehicle Detection, Re-Identification, and Tracking [58.95210121654722]
We propose a real-time city-scale multi-camera vehicle tracking system that handles real-world, low-resolution CCTV instead of idealized and curated video streams. Our method is ranked among the top five performers on the public leaderboard.
arXiv Detail & Related papers (2022-04-15T12:47:01Z)
Edge Computing for Real-Time Near-Crash Detection for Smart Transportation Applications [29.550609157368466]
Traffic near-crash events serve as critical data sources for various smart transportation applications. This paper leverages the power of edge computing to address these challenges by processing the video streams from existing dashcams onboard in a real-time manner. It is among the first efforts in applying edge computing for real-time traffic video analytics and is expected to benefit multiple sub-fields in smart transportation research and applications.
arXiv Detail & Related papers (2020-08-02T19:39:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.