Related papers: Simplifying Traffic Anomaly Detection with Video Foundation Models

Simplifying Traffic Anomaly Detection with Video Foundation Models

URL: http://arxiv.org/abs/2507.09338v1
Date: Sat, 12 Jul 2025 16:36:49 GMT
Title: Simplifying Traffic Anomaly Detection with Video Foundation Models
Authors: Svetlana Orlova, Tommie Kerssies, Brunó B. Englert, Gijs Dubbelman,
Abstract summary: Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures.<n>Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs.<n>We investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance.
Score: 1.0999592665107416
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) strong pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity. We release our code, domain-adapted encoders, and fine-tuned models to support future work: https://github.com/tue-mps/simple-tad.

Related papers

Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation [18.402668470092294]
Synthetic video generation can produce very realistic high-resolution videos that are virtually indistinguishable from real ones.<n>Several video forensic detectors have been recently proposed, but they often exhibit poor generalization.<n>We introduce a novel data augmentation strategy based on the wavelet decomposition and replace specific frequency-related bands to drive the model to exploit more relevant forensic cues.<n>Our method achieves a significant accuracy improvement over state-of-the-art detectors and obtains excellent results even on very recent generative models.
arXiv Detail & Related papers (2025-06-20T07:36:59Z)
Scaling Laws for Native Multimodal Models [53.490942903659565]
We revisit the architectural design of native multimodal models and conduct an extensive scaling laws study.<n>Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones.<n>We show that incorporating Mixture of Experts (MoEs) allows for models that learn modality-specific weights, significantly enhancing performance.
arXiv Detail & Related papers (2025-04-10T17:57:28Z)
COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition [3.271109623410664]
We propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations.<n>Our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications.
arXiv Detail & Related papers (2025-03-10T12:43:51Z)
Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities. In-Context Learning (ICL) and. Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting. LLMs to downstream tasks. We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z)
SOLO: A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.<n>A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.<n>In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z)
Self-Supervised Learning with Generative Adversarial Networks for Electron Microscopy [0.0]
We show how self-supervised pretraining facilitates efficient fine-tuning for a spectrum of downstream tasks. We demonstrate the versatility of self-supervised pretraining across various downstream tasks in the context of electron microscopy.
arXiv Detail & Related papers (2024-02-28T12:25:01Z)
Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers [10.72362704573323]
We introduce the PlainSeg, a model comprising only three 3$times$3 convolutions in addition to the transformer layers. We also present the PlainSeg-Hier, which allows for the utilization of hierarchical features.
arXiv Detail & Related papers (2023-10-19T14:01:40Z)
TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter [21.41170708560114]
A growing number of applications based on visual foundation models are emerging. In situations involving system upgrades, it becomes essential to re-train all downstream modules to adapt to the new foundation model. We introduce a parameter-efficient and task-agnostic adapter, dubbed TaCA, that facilitates compatibility across distinct foundation models.
arXiv Detail & Related papers (2023-06-22T03:00:24Z)
Unbiased Learning of Deep Generative Models with Structured Discrete Representations [7.9057320008285945]
We propose novel algorithms for learning structured variational autoencoders (SVAEs) We are the first to demonstrate the SVAE's ability to handle multimodal uncertainty when data is missing by incorporating discrete latent variables. Our memory-efficient implicit differentiation scheme makes the SVAE tractable to learn via gradient descent, while demonstrating robustness to incomplete optimization.
arXiv Detail & Related papers (2023-06-14T03:59:21Z)
Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM) This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature. We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation [99.88539409432916]
We study the unsupervised domain adaptation (UDA) process. We propose a novel UDA method, DAFormer, based on the benchmark results. DAFormer significantly improves the state-of-the-art performance by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for Synthia->Cityscapes.
arXiv Detail & Related papers (2021-11-29T19:00:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.