Related papers: iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

URL: http://arxiv.org/abs/2509.19552v2
Date: Wed, 01 Oct 2025 06:54:44 GMT
Title: iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning
Authors: Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker, Abhishek Aich,
Abstract summary: iFinder is a semantic grounding framework that translates dash-cam videos into a hierarchical, interpretable data structure for large language models.<n>iFinder operates as a training-free pipeline that employs pretrained vision models to extract critical cues.<n>It significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks.
Score: 51.15353027471834
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues -- object pose, lane positions, and object trajectories -- which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM's outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder's proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.

Related papers

LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation [32.57236582010967]
Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data.<n>We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference.<n>We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation.
arXiv Detail & Related papers (2025-12-18T18:52:18Z)
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds [56.749972238005604]
This paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos.<n> TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces.<n>In turn, logit scores define embedding directions for conditional video generation.
arXiv Detail & Related papers (2025-11-23T09:12:48Z)
LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models [9.647551134303384]
Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements.<n>We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning.
arXiv Detail & Related papers (2025-09-29T21:32:54Z)
Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation [17.238084264485988]
Referring Video Object (RVOS) aims to segment an object of interest throughout a video based on a language description.<n>bftextPARSE-VOS is a training-free framework powered by Large Language Models (LLMs)<n>bftextPARSE-VOS achieved state-of-the-art performance on three major benchmarks: Ref-YouTube-VOS, Ref-DAVIS17, and MeViS.
arXiv Detail & Related papers (2025-09-06T15:46:23Z)
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.<n>However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.<n>We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z)
Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Training-based methods are prone to be domain-specific, thus being costly for practical deployment. We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z)
Understanding Long Videos with Multimodal Language Models [44.78900245769057]
Large Language Models (LLMs) have allowed recent approaches to achieve excellent performance on long-video understanding benchmarks.<n>We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance.<n>Our resulting Multimodal Video Understanding framework demonstrates state-of-the-art performance across multiple video understanding benchmarks.
arXiv Detail & Related papers (2024-03-25T17:59:09Z)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z)
Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z)
Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning. This task unifies spatial and temporal localization in video. We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.