Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos
- URL: http://arxiv.org/abs/2510.12190v1
- Date: Tue, 14 Oct 2025 06:36:41 GMT
- Title: Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos
- Authors: Shingo Yokoi, Kento Sasaki, Yu Yamaguchi,
- Abstract summary: We present a hierarchical reasoning framework for incident report generation from dashcam videos.<n>We integrate frame-level captioning, incident frame detection, and fine-grained reasoning within vision-language models.<n>On the official 2COOOL open leaderboard, our method ranks 2nd among 29 teams and achieves the best CIDEr-D score.
- Score: 0.03598453624340711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in end-to-end (E2E) autonomous driving have been enabled by training on diverse large-scale driving datasets, yet autonomous driving models still struggle in out-of-distribution (OOD) scenarios. The COOOL benchmark targets this gap by encouraging hazard understanding beyond closed taxonomies, and the 2COOOL challenge extends it to generating human-interpretable incident reports. We present a hierarchical reasoning framework for incident report generation from dashcam videos that integrates frame-level captioning, incident frame detection, and fine-grained reasoning within vision-language models (VLMs). We further improve factual accuracy and readability through model ensembling and a Blind A/B Scoring selection protocol. On the official 2COOOL open leaderboard, our method ranks 2nd among 29 teams and achieves the best CIDEr-D score, producing accurate and coherent incident narratives. These results indicate that hierarchical reasoning with VLMs is a promising direction for accident analysis and for broader understanding of safety-critical traffic events. The implementation and code are available at https://github.com/riron1206/kaggle-2COOOL-2nd-Place-Solution.
Related papers
- Generative Scenario Rollouts for End-to-End Autonomous Driving [58.99809446189301]
Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems.<n>We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes.
arXiv Detail & Related papers (2026-01-16T17:59:28Z) - dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [69.36145467833498]
We introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving.<n> evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems.
arXiv Detail & Related papers (2025-12-04T05:05:41Z) - Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z) - FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling [5.609178055761294]
We present FSDAM, a framework that achieves joint attention prediction and caption generation with 100 annotated examples.<n> FSDAM achieves competitive performance on attention prediction, generates coherent, and context-aware explanations.<n>This work shows that effective attention-conditioned generation is achievable with limited supervision, opening new possibilities for practical deployment of explainable driver attention systems.
arXiv Detail & Related papers (2025-11-16T17:50:30Z) - From Narratives to Probabilistic Reasoning: Predicting and Interpreting Drivers' Hazardous Actions in Crashes Using Large Language Model [3.3457493284891338]
Two-vehicle crashes account for approximately 70% of roadway crashes.<n>Driver Hazardous Action (DHA) data is limited by inconsistent and labor-intensive manual coding practices.<n>Here, we present an innovative framework that leverages a fine-tuned large language model to automatically infer DHAs from textual crash narratives.
arXiv Detail & Related papers (2025-10-14T21:35:47Z) - Towards Safer and Understandable Driver Intention Prediction [30.136400523083907]
We introduce the task of interpretability in maneuver prediction before they occur for driver safety.<n>To foster research in interpretable DIP, we curate the DAAD-X, a new multimodal, ego-centric video dataset.<n>Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates coherent explanations inherently.
arXiv Detail & Related papers (2025-10-10T09:41:25Z) - CoReVLA: A Dual-Stage End-to-End Autonomous Driving Framework for Long-Tail Scenarios via Collect-and-Refine [73.74077186298523]
CoReVLA is a continual learning framework for autonomous driving.<n>It improves the performance in long-tail scenarios through a dual-stage process of data Collection and behavior Refinement.<n>CoReVLA achieves a Driving Score (DS) of 72.18 and a Success Rate (SR) of 50%, outperforming state-of-the-art methods by 7.96 DS and 15% SR under long-tail, safety-critical scenarios.
arXiv Detail & Related papers (2025-09-19T13:25:56Z) - MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding [7.093473654069259]
We propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities.<n>Experiments on the BDD-X, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning.<n>The model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications.
arXiv Detail & Related papers (2025-07-08T15:14:53Z) - ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z) - SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models [63.71984266104757]
We propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge.<n>To explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic.<n>Our Multimodal Retrieval-Augmented Generation model leverages video, control signals, and environmental attributes to learn from past driving experiences.
arXiv Detail & Related papers (2025-02-28T21:53:47Z) - Cognitive Accident Prediction in Driving Scenes: A Multimodality
Benchmark [77.54411007883962]
We propose a Cognitive Accident Prediction (CAP) method that explicitly leverages human-inspired cognition of text description on the visual observation and the driver attention to facilitate model training.
CAP is formulated by an attentive text-to-vision shift fusion module, an attentive scene context transfer module, and the driver attention guided accident prediction module.
We construct a new large-scale benchmark consisting of 11,727 in-the-wild accident videos with over 2.19 million frames.
arXiv Detail & Related papers (2022-12-19T11:43:02Z) - Divide-and-Conquer for Lane-Aware Diverse Trajectory Prediction [71.97877759413272]
Trajectory prediction is a safety-critical tool for autonomous vehicles to plan and execute actions.
Recent methods have achieved strong performances using Multi-Choice Learning objectives like winner-takes-all (WTA) or best-of-many.
Our work addresses two key challenges in trajectory prediction, learning outputs, and better predictions by imposing constraints using driving knowledge.
arXiv Detail & Related papers (2021-04-16T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.