Related papers: STAC: Leveraging Spatio-Temporal Data Associations For Efficient Cross-Camera Streaming and Analytics

STAC: Leveraging Spatio-Temporal Data Associations For Efficient Cross-Camera Streaming and Analytics

URL: http://arxiv.org/abs/2401.15288v2
Date: Wed, 13 Aug 2025 15:28:59 GMT
Title: STAC: Leveraging Spatio-Temporal Data Associations For Efficient Cross-Camera Streaming and Analytics
Authors: Ragini Gupta, Lingzhi Zhao, Jiaxi Li, Volodymyr Vakhniuk, Claudiu Danilov, Josh Eckhardt, Keyshla Bernard, Klara Nahrstedt,
Abstract summary: In distributed network of cameras, real-time multi-camera video analytics is challenged by high bandwidth demands and redundant visual data.<n>We present STAC, a cross-camera surveillances system that leverages multi-temporal associations for efficient object tracking under constrained network conditions.
Score: 5.752749052742801
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In IoT based distributed network of cameras, real-time multi-camera video analytics is challenged by high bandwidth demands and redundant visual data, creating a fundamental tension where reducing data saves network overhead but can degrade model performance, and vice versa. We present STAC, a cross-cameras surveillance system that leverages spatio-temporal associations for efficient object tracking under constrained network conditions. STAC integrates multi-resolution feature learning, ensuring robustness under variable networked system level optimizations such as frame filtering, FFmpeg-based compression, and Region-of-Interest (RoI) masking, to eliminate redundant content across distributed video streams while preserving downstream model accuracy for object identification and tracking. Evaluated on NVIDIA's AICity Challenge dataset, STAC achieves a 76\% improvement in tracking accuracy and an 8.6x reduction in inference latency over a standard multi-object multi-camera tracking baseline (using YOLOv4 and DeepSORT). Furthermore, 29\% of redundant frames are filtered, significantly reducing data volume without compromising inference quality.

Related papers

A Secure and Private Distributed Bayesian Federated Learning Design [56.92336577799572]
Distributed Federated Learning (DFL) enables decentralized model training across large-scale systems without a central parameter server.<n>DFL faces three critical challenges: privacy leakage from honest-but-curious neighbors, slow convergence due to the lack of central coordination, and vulnerability to Byzantine adversaries aiming to degrade model accuracy.<n>We propose a novel DFL framework that integrates Byzantine robustness, privacy preservation, and convergence acceleration.
arXiv Detail & Related papers (2026-02-23T16:12:02Z)
Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection? [57.000348519630286]
Recent advances in mobile edge computing have made it possible to offload-intensive object detection to edge servers equipped with high-accuracy neural networks.<n>This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking.<n>We propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection.
arXiv Detail & Related papers (2025-11-25T04:54:51Z)
TRACER: Efficient Object Re-Identification in Networked Cameras through Adaptive Query Processing [8.955401552705892]
Spatula is the state-of-the-art video database management system (VDBMS) for processing Re-ID queries.<n>It is not suitable for critical video analytics applications that require high recall due to camera history.<n>We present Tracer, a novel VDBMS for efficiently processing Re-ID queries using an adaptive query processing framework.
arXiv Detail & Related papers (2025-07-13T02:22:08Z)
Deep Learning and Hybrid Approaches for Dynamic Scene Analysis, Object Detection and Motion Tracking [0.0]
This project aims to develop a robust video surveillance system, which can segment videos into smaller clips based on the detection of activities.<n>It uses CCTV footage, for example, to record only major events-like the appearance of a person or a thief-so that storage is optimized and digital searches are easier.
arXiv Detail & Related papers (2024-12-05T07:44:40Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Accelerated Event-Based Feature Detection and Compression for Surveillance Video Systems [1.5390526524075634]
We propose a novel system which conveys temporal redundancy within a sparse decompressed representation. We leverage a video representation framework called ADDER to transcode framed videos to sparse, asynchronous intensity samples. Our work paves the way for upcoming neuromorphic sensors and is amenable to future applications with spiking neural networks.
arXiv Detail & Related papers (2023-12-13T15:30:29Z)
Learn to Compress (LtC): Efficient Learning-based Streaming Video Analytics [3.2872586139884623]
LtC is a collaborative framework between the video source and the analytics server that efficiently learns to reduce the video streams within an analytics pipeline. LtC is able to use 28-35% less bandwidth and has up to 45% shorter response delay compared to recently published state of the art streaming frameworks.
arXiv Detail & Related papers (2023-07-22T21:36:03Z)
Spatiotemporal Attention-based Semantic Compression for Real-time Video Recognition [117.98023585449808]
We propose a temporal attention-based autoencoder (STAE) architecture to evaluate the importance of frames and pixels in each frame. We develop a lightweight decoder that leverages a 3D-2D CNN combined to reconstruct missing information. Experimental results show that ViT_STAE can compress the video dataset H51 by 104x with only 5% accuracy loss.
arXiv Detail & Related papers (2023-05-22T07:47:27Z)
GPU-accelerated SIFT-aided source identification of stabilized videos [63.084540168532065]
We exploit the parallelization capabilities of Graphics Processing Units (GPUs) in the framework of stabilised frames inversion. We propose to exploit SIFT features. to estimate the camera momentum and %to identify less stabilized temporal segments. Experiments confirm the effectiveness of the proposed approach in reducing the required computational time and improving the source identification accuracy.
arXiv Detail & Related papers (2022-07-29T07:01:31Z)
FrameHopper: Selective Processing of Video Frames in Detection-driven Real-Time Video Analytics [2.5119455331413376]
Detection-driven real-time video analytics require continuous detection of objects contained in the video frames. Running these detectors on each and every frame in resource-constrained edge devices is computationally intensive. We propose an off-line Reinforcement Learning (RL)-based algorithm to determine these skip-lengths.
arXiv Detail & Related papers (2022-03-22T07:05:57Z)
Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework. It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z)
CANS: Communication Limited Camera Network Self-Configuration for Intelligent Industrial Surveillance [8.360870648463653]
Realtime and intelligent video surveillance via camera networks involve computation-intensive vision detection tasks with massive video data. Multiple video streams compete for limited communication resources on the link between edge devices and camera networks. An adaptive camera network self-configuration method (CANS) of video surveillance is proposed to cope with multiple video streams of heterogeneous quality of service.
arXiv Detail & Related papers (2021-09-13T01:54:33Z)
Energy-Efficient Model Compression and Splitting for Collaborative Inference Over Time-Varying Channels [52.60092598312894]
We propose a technique to reduce the total energy bill at the edge device by utilizing model compression and time-varying model split between the edge and remote nodes. Our proposed solution results in minimal energy consumption and $CO$ emission compared to the considered baselines.
arXiv Detail & Related papers (2021-06-02T07:36:27Z)
Personal Privacy Protection via Irrelevant Faces Tracking and Pixelation in Video Live Streaming [61.145467627057194]
We develop a new method called Face Pixelation in Video Live Streaming to generate automatic personal privacy filtering. For fast and accurate pixelation of irrelevant people's faces, FPVLS is organized in a frame-to-video structure of two core stages. On the video live streaming dataset we collected, FPVLS obtains satisfying accuracy, real-time efficiency, and contains the over-pixelation problems.
arXiv Detail & Related papers (2021-01-04T16:18:26Z)
Temporal Context Aggregation for Video Retrieval with Contrastive Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features. The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS) We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance. We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
Single Shot Video Object Detector [215.06904478667337]
Single Shot Video Object Detector (SSVD) is a new architecture that novelly integrates feature aggregation into a one-stage detector for object detection in videos. For $448 times 448$ input, SSVD achieves 79.2% mAP on ImageNet VID dataset.
arXiv Detail & Related papers (2020-07-07T15:36:26Z)
CONVINCE: Collaborative Cross-Camera Video Analytics at the Edge [1.5469452301122173]
This paper introduces CONVINCE, a new approach to look at cameras as a collective entity that enables collaborative video analytics pipeline among cameras. Our results demonstrate that CONVINCE achieves an object identification accuracy of $sim$91%, by transmitting only about $sim$25% of all the recorded frames.
arXiv Detail & Related papers (2020-02-05T23:55:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.