Online Reasoning Video Segmentation with Just-in-Time Digital Twins
- URL: http://arxiv.org/abs/2503.21056v1
- Date: Thu, 27 Mar 2025 00:06:40 GMT
- Title: Online Reasoning Video Segmentation with Just-in-Time Digital Twins
- Authors: Yiqing Shen, Bohan Liu, Chenjia Li, Lalithkumar Seenivasan, Mathias Unberath,
- Abstract summary: Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries.<n>Current RS approaches rely heavily on the visual perception capabilities of multimodal large language models.<n>We propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning.
- Score: 8.568569213914378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where -- given an implicit query -- a LLM plans the construction of a low-level scene representation from high-level video using specialist vision models. We refer to this approach to creating a digital twin as "just-in-time" because the LLM planner will anticipate the need for specific information and only request this limited subset instead of always evaluating every specialist model. The LLM then performs reasoning on this digital twin representation to identify target objects. To evaluate our approach, we introduce a new comprehensive video reasoning segmentation benchmark comprising 200 videos with 895 implicit text queries. The benchmark spans three reasoning categories (semantic, spatial, and temporal) with three different reasoning chain complexity.
Related papers
- QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension [86.0749609778104]
We propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models.<n>QuoTA strategically allocates frame-level importance scores based on query relevance.<n>We decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring.
arXiv Detail & Related papers (2025-03-11T17:59:57Z) - HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding [14.464718780172582]
We introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling.
We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding.
Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance.
arXiv Detail & Related papers (2025-03-11T16:21:23Z) - Do Language Models Understand Time? [2.290956583394892]
Large language models (LLMs) have revolutionized video-based computer vision applications, including action recognition, anomaly detection, and summarization.<n>This work critically examines the role of LLMs in video processing, with a specific focus on their temporal reasoning capabilities.<n>We analyze challenges posed by existing video datasets, including biases, lack of temporal annotations, and domain-specific limitations that constrain the temporal understanding of LLMs.
arXiv Detail & Related papers (2024-12-18T13:38:06Z) - VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level [63.18855743293851]
Motion-Grounded Video Reasoning is a new motion understanding task that requires visual answers (video segmentation masks) according to the input question.
This task extends existing grounding work on explicit action/motion grounding to a more general format by enabling implicit reasoning via questions.
We introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA)
arXiv Detail & Related papers (2024-11-15T03:45:09Z) - VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs [27.473258727617477]
Long video understanding presents unique challenges due to the complexity of reasoning over extended timespans.
We propose a framework VideoINSTA, i.e. INformative Spatial-TemporAl Reasoning for long-form video understanding.
Our model significantly improves the state-of-the-art on three long video question-answering benchmarks.
arXiv Detail & Related papers (2024-09-30T15:04:14Z) - ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We present ViLLa: Video reasoning segmentation with Large Language Model.
Our ViLLa manages to tackle these challenges through multiple core innovations.
To enable efficient processing of long videos, ViLLa incorporates (3) a key segment sampler that adaptively partitions long videos into shorter but semantically dense segments for less redundancy.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills.
utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches.
To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.