Harnessing Object Grounding for Time-Sensitive Video Understanding
- URL: http://arxiv.org/abs/2509.06335v1
- Date: Mon, 08 Sep 2025 04:52:00 GMT
- Title: Harnessing Object Grounding for Time-Sensitive Video Understanding
- Authors: Tz-Ying Wu, Sharath Nittur Sridhar, Subarna Tripathi,
- Abstract summary: We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO)<n>GO-Tokenizer is a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly.
- Score: 13.599316633905355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO). We hypothesize that TSV tasks can benefit from GO within frames, which is supported by our preliminary experiments on LITA, a state-of-the-art Video-LLM for reasoning temporal localization. While augmenting prompts with textual description of these object annotations improves the performance of LITA, it also introduces extra token length and susceptibility to the noise in object level information. To address this, we propose GO-Tokenizer, a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly. Experimental results demonstrate that pretraining with GO-Tokenizer outperforms the vanilla Video-LLM and its counterpart utilizing textual description of objects in the prompt. The gain generalizes across different models, datasets and video understanding tasks such as reasoning temporal localization and dense captioning.
Related papers
- Universal Video Temporal Grounding with Generative Multi-modal Large Language Models [59.781211641591405]
This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries.<n>We propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs)<n>Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries.
arXiv Detail & Related papers (2025-06-23T17:53:18Z) - Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting [60.58915701973593]
We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning.<n>Cat-V integrates three key components: a Segmenter based on SAMI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE-UniVL, and a Captioner using Intern-2.5.<n>Our framework generates detailed, temporally-aware descriptions of objects' attributes, actions, statuses, interactions, and environmental contexts without requiring additional training data.
arXiv Detail & Related papers (2025-04-07T22:35:36Z) - Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking [15.551049337773962]
Vision-language models (VLMs) excel in visual understanding but often lack reliable grounding capabilities and actionable inference rates.<n>We utilize VLM-generated structured descriptions to identify visible object instances, collect application-relevant attributes, and inform an open-vocabulary detector to extract corresponding bounding boxes.<n>Tracks can be updated online as needed by generating new structured descriptions and detections.
arXiv Detail & Related papers (2025-03-18T20:18:42Z) - VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.<n>However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.<n>We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z) - VideoOrion: Tokenizing Object Dynamics in Videos [34.96534298857146]
We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos.<n>VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline.<n>Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens.
arXiv Detail & Related papers (2024-11-25T07:32:02Z) - One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos [41.34787907803329]
VideoLISA is a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos.
VideoLISA generates temporally consistent segmentation masks in videos based on language instructions.
arXiv Detail & Related papers (2024-09-29T07:47:15Z) - Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning [102.54669633984278]
We propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks.
We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization.
arXiv Detail & Related papers (2024-02-18T03:04:38Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail
Enhancement [44.228748086927375]
We introduce the video-based object-oriented video captioning network (OVC)-Net via temporal graph and detail enhancement.
To demonstrate the effectiveness, we conduct experiments on the new dataset and compare it with the state-of-the-art video captioning methods.
arXiv Detail & Related papers (2020-03-08T04:34:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.