Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis
- URL: http://arxiv.org/abs/2510.11907v1
- Date: Mon, 13 Oct 2025 20:18:23 GMT
- Title: Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis
- Authors: Blessing Agyei Kyem, Neema Jakisa Owor, Andrews Danyo, Joshua Kofi Asamoah, Eugene Denteh, Tanner Muturi, Anthony Dontoh, Yaw Adu-Gyamfi, Armstrong Aboah,
- Abstract summary: Traffic safety analysis requires complex video understanding to capture behavioral patterns and generate descriptions for accident prevention.<n>In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization.
- Score: 7.392659193819963
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traffic safety analysis requires complex video understanding to capture fine-grained behavioral patterns and generate comprehensive descriptions for accident prevention. In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization to address this issue. The core insight behind our approach is that separating training for captioning and visual question answering (VQA) tasks minimizes task interference and allows each model to specialize more effectively. Experimental results demonstrate that VideoLLaMA is particularly effective in temporal reasoning, achieving a CIDEr score of 1.1001, while Qwen2.5-VL excels in visual understanding with a VQA accuracy of 60.80\%. Through extensive experiments on the WTS dataset, our method achieves an S2 score of 45.7572 in the 2025 AI City Challenge Track 2, placing 10th on the challenge leaderboard. Ablation studies validate that our separate training strategy outperforms joint training by 8.6\% in VQA accuracy while maintaining captioning quality.
Related papers
- STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning [65.36458157092207]
In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations.<n>We propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities.<n>We introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization.
arXiv Detail & Related papers (2026-02-12T08:53:32Z) - Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge [0.0]
We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge.<n>The BEHAVIOR Challenge is a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation.<n>Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.
arXiv Detail & Related papers (2025-12-07T18:08:45Z) - VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations [59.40631942092535]
Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries.<n>Recent Multimodal Large Language Models (MLLMs) have shown promise in tackling VTG through reinforcement learning (RL)<n>We propose VideoTG-R1, a novel curriculum RL framework with reflected boundary annotations, enabling data-efficient training.
arXiv Detail & Related papers (2025-10-27T14:55:38Z) - Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts [27.64955941993406]
We present a vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions.<n>In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models.<n> Notably, the system maintains 96% accuracy under severe visual corruption.
arXiv Detail & Related papers (2025-10-21T18:24:59Z) - MVQA-68K: A Multi-dimensional and Causally-annotated Dataset with Quality Interpretability for Video Assessment [14.705190484805962]
Video quality assessment (VQA) is becoming increasingly crucial for selecting high-quality videos from large-scale datasets used in pre-training.<n>We introduce MVQA-68K, a novel multi-dimensional VQA dataset comprising over 68,000 carefully annotated videos.<n>Experiments demonstrate that MVQA-68K significantly enhances the performance of various large language models (MLLMs) on the VQA task.
arXiv Detail & Related papers (2025-09-15T05:16:54Z) - VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning [50.34205095371895]
Video quality assessment aims to objectively quantify perceptual quality degradation.<n>Existing VQA models suffer from two critical limitations.<n>We propose textbfVQAThinker, a reasoning-based VQA framework.
arXiv Detail & Related papers (2025-08-08T06:16:23Z) - Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment [10.701522670464463]
multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments.<n>We propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage.<n>We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder.
arXiv Detail & Related papers (2025-06-03T10:11:51Z) - Towards Generalized Video Quality Assessment: A Weak-to-Strong Learning Paradigm [76.63001244080313]
Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception.<n>The dominant VQA paradigm relies on supervised training with human-labeled datasets.<n>We explore weak-to-strong (W2S) learning as a new paradigm for advancing VQA without reliance on large-scale human-labeled datasets.
arXiv Detail & Related papers (2025-05-06T15:29:32Z) - VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.<n>Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.<n>We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.<n>The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge [4.075139470537149]
We present our first-place solution to the Multiple-choice Video Question Answering track of The Second Perception Test Challenge.
This competition posed a complex video understanding task, requiring models to accurately comprehend and answer questions about video content.
arXiv Detail & Related papers (2024-09-20T14:31:13Z) - Exploring Question Decomposition for Zero-Shot VQA [99.32466439254821]
We investigate a question decomposition strategy for visual question answering.
We show that naive application of model-written decompositions can hurt performance.
We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors.
arXiv Detail & Related papers (2023-10-25T23:23:57Z) - GAT: Guided Adversarial Training with Pareto-optimal Auxiliary Tasks [73.88590165742721]
We propose a novel adversarial training technique that exploits auxiliary tasks under a limited set of training data.
Our approach extends single-task models into multi-task models during the min-max optimization of adversarial training.
We demonstrate that guided multi-task learning is an actionable and promising avenue to push further the boundaries of model robustness.
arXiv Detail & Related papers (2023-02-06T16:23:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.