MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation
- URL: http://arxiv.org/abs/2509.19227v1
- Date: Tue, 23 Sep 2025 16:49:25 GMT
- Title: MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation
- Authors: Tongshuai Wu, Chao Lu, Ze Song, Yunlong Lin, Sizhe Fan, Xuemei Chen,
- Abstract summary: A Multi-scale Feature Interaction Network (MsFIN) is proposed for early-stage accident anticipation from dashcam videos.<n>MsFIN has three layers for multi-scale feature aggregation, temporal feature processing and multi-scale feature post fusion.<n>Experiments on DAD and DADA datasets show that MsFIN significantly outperforms state-of-the-art models with single-scale feature extraction.
- Score: 11.143415608240057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the widespread deployment of dashcams and advancements in computer vision, developing accident prediction models from the dashcam perspective has become critical for proactive safety interventions. However, two key challenges persist: modeling feature-level interactions among traffic participants (often occluded in dashcam views) and capturing complex, asynchronous multi-temporal behavioral cues preceding accidents. To deal with these two challenges, a Multi-scale Feature Interaction Network (MsFIN) is proposed for early-stage accident anticipation from dashcam videos. MsFIN has three layers for multi-scale feature aggregation, temporal feature processing and multi-scale feature post fusion, respectively. For multi-scale feature aggregation, a Multi-scale Module is designed to extract scene representations at short-term, mid-term and long-term temporal scales. Meanwhile, the Transformer architecture is leveraged to facilitate comprehensive feature interactions. Temporal feature processing captures the sequential evolution of scene and object features under causal constraints. In the multi-scale feature post fusion stage, the network fuses scene and object features across multiple temporal scales to generate a comprehensive risk representation. Experiments on DAD and DADA datasets show that MsFIN significantly outperforms state-of-the-art models with single-scale feature extraction in both prediction correctness and earliness. Ablation studies validate the effectiveness of each module in MsFIN, highlighting how the network achieves superior performance through multi-scale feature fusion and contextual interaction modeling.
Related papers
- Model Merging in the Essential Subspace [78.5390284258307]
Model merging aims to integrate multiple task-specific fine-tuned models into a single multi-task model without additional training.<n>Despite extensive research, task interference remains a major obstacle that often undermines the performance of merged models.<n>We propose ESM (Essential Subspace Merging), a robust framework for effective model merging.
arXiv Detail & Related papers (2026-02-23T00:33:38Z) - Diffusion Forcing for Multi-Agent Interaction Sequence Modeling [52.769202433667125]
MAGNet is a unified autoregressive diffusion framework for multi-agent motion generation.<n>It supports a wide range of interaction tasks through flexible conditioning and sampling.<n>It captures both tightly synchronized activities and loosely structured social interactions.
arXiv Detail & Related papers (2025-12-19T18:59:02Z) - MGTS-Net: Exploring Graph-Enhanced Multimodal Fusion for Augmented Time Series Forecasting [1.7077661158850292]
We propose MGTS-Net, a Multimodal Graph-enhanced Network for Time Series forecasting.<n>The model consists of three core components: (1) a Multimodal Feature Extraction layer (MFE), (2) a Multimodal Feature Fusion layer (MFF), and (3) a Multi-Scale Prediction layer (MSP)
arXiv Detail & Related papers (2025-10-18T04:47:10Z) - Attention-Driven Multimodal Alignment for Long-term Action Quality Assessment [5.262258418692889]
Long-term action quality assessment (AQA) focuses on evaluating the quality of human activities in videos lasting up to several minutes.<n>Long-term Multimodal Attention Consistency Network (LMAC-Net) introduces a multimodal attention consistency mechanism to explicitly align multimodal features.<n>Experiments conducted on the RG and Fis-V datasets demonstrate that LMAC-Net significantly outperforms existing methods.
arXiv Detail & Related papers (2025-07-29T15:58:39Z) - DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z) - Multi-scale Temporal Fusion Transformer for Incomplete Vehicle Trajectory Prediction [23.72022120344089]
Motion prediction plays an essential role in autonomous driving systems.
We propose a novel end-to-end framework for incomplete vehicle trajectory prediction.
We evaluate the proposed model on four datasets derived from highway and urban traffic scenarios.
arXiv Detail & Related papers (2024-09-02T02:36:18Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.<n>DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.<n>Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - CRASH: Crash Recognition and Anticipation System Harnessing with Context-Aware and Temporal Focus Attentions [13.981748780317329]
Accurately and promptly predicting accidents among surrounding traffic agents from camera footage is crucial for the safety of autonomous vehicles (AVs)
This study introduces a novel accident anticipation framework for AVs, termed CRASH.
It seamlessly integrates five components: object detector, feature extractor, object-aware module, context-aware module, and multi-layer fusion.
Our model surpasses existing top baselines in critical evaluation metrics like Average Precision (AP) and mean Time-To-Accident (mTTA)
arXiv Detail & Related papers (2024-07-25T04:12:49Z) - AccidentBlip: Agent of Accident Warning based on MA-former [24.81148840857782]
AccidentBlip is a vision-only framework that employs our self-designed Motion Accident Transformer (MA-former) to process each frame of video.<n> AccidentBlip achieves performance in both accident detection and prediction tasks on the DeepAccident dataset.<n>It also outperforms current SOTA methods in V2V and V2X scenarios, demonstrating a superior capability to understand complex real-world environments.
arXiv Detail & Related papers (2024-04-18T12:54:25Z) - Scalable Multi-modal Model Predictive Control via Duality-based Interaction Predictions [8.256630421682951]
RAID-Net is a novel attention-based Recurrent Neural Network that predicts relevant interactions along the Model Predictive Control (MPC) prediction horizon.<n>Our approach is demonstrated in a simulated traffic horizon with interactive surrounding vehicles, showcasing a 12x speed-up in solving the motion planning problem.
arXiv Detail & Related papers (2024-02-02T03:19:54Z) - Mutual Information-driven Triple Interaction Network for Efficient Image
Dehazing [54.168567276280505]
We propose a novel Mutual Information-driven Triple interaction Network (MITNet) for image dehazing.
The first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal.
The second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum.
arXiv Detail & Related papers (2023-08-14T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.