Object-Centric Framework for Video Moment Retrieval
- URL: http://arxiv.org/abs/2512.18448v1
- Date: Sat, 20 Dec 2025 17:44:53 GMT
- Title: Object-Centric Framework for Video Moment Retrieval
- Authors: Zongyao Li, Yongkang Wong, Satoshi Yamazaki, Jianquan Liu, Mohan Kankanhalli,
- Abstract summary: Most existing moment retrieval methods rely on temporal sequences of frame-level features that primarily encode global visual and semantic information.<n>In particular temporal dynamics at the object level have been largely overlooked, limiting existing approaches in scenarios requiring object-level reasoning.<n>Our method first extracts query-relevant objects using a scene graph and then graphs from video frames to represent these objects and their relationships.<n>Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a video tracklet transformer, which models relational-temporal localization among objects over time.
- Score: 15.916994168542345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.
Related papers
- Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting [60.58915701973593]
We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning.<n>Cat-V integrates three key components: a Segmenter based on SAMI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE-UniVL, and a Captioner using Intern-2.5.<n>Our framework generates detailed, temporally-aware descriptions of objects' attributes, actions, statuses, interactions, and environmental contexts without requiring additional training data.
arXiv Detail & Related papers (2025-04-07T22:35:36Z) - The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning [49.40254251698784]
We propose a dynamic learning approach for moment retrieval, where two strategies are designed to mitigate the spurious correlation.<n>First, we introduce a novel video synthesis approach to construct a dynamic context for the queried moment.<n>Second, to alleviate the over-association with backgrounds, we enhance representations temporally by incorporating text-dynamics interaction.
arXiv Detail & Related papers (2025-01-13T13:13:06Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.<n>This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - A Modern Take on Visual Relationship Reasoning for Grasp Planning [10.543168383800532]
We present a modern take on visual relational reasoning for grasp planning.<n>We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories.<n>We also propose D3G, a new end-to-end transformer-based dependency graph generation model.
arXiv Detail & Related papers (2024-09-03T16:30:48Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Temporal Collection and Distribution for Referring Video Object
Segmentation [14.886278504056063]
Referring video object segmentation aims to segment a referent throughout a video sequence according to a natural language expression.
We propose to simultaneously maintain a global referent token and a sequence of object queries.
We show that our method outperforms state-of-the-art methods on all benchmarks consistently and significantly.
arXiv Detail & Related papers (2023-09-07T04:22:02Z) - Identity-Consistent Aggregation for Video Object Detection [21.295859014601334]
In Video Object Detection (VID), a common practice is to leverage the rich temporal contexts from the video to enhance the object representations in each frame.
We propose ClipVID, a VID model equipped with Identity-Consistent Aggregation layers specifically designed for mining fine-grained and identity-consistent temporal contexts.
Experiments demonstrate the superiority of our method: a state-of-the-art (SOTA) performance (84.7% mAP) on the ImageNet VID dataset while running at a speed about 7x faster (39.3 fps) than previous SOTAs.
arXiv Detail & Related papers (2023-08-15T12:30:22Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z) - Evaluating Temporal Queries Over Video Feeds [25.04363138106074]
Temporal queries involving objects and their co-occurrences in video feeds are of interest to many applications ranging from law enforcement to security and safety.
We present an architecture consisting of three layers, namely object detection/tracking, intermediate data generation and query evaluation.
We propose two techniques,MFS and SSG, to organize all detected objects in the intermediate data generation layer.
We also introduce an algorithm called State Traversal (ST) that processes incoming frames against the SSG and efficiently prunes objects and frames unrelated to query evaluation.
arXiv Detail & Related papers (2020-03-02T14:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.