VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
- URL: http://arxiv.org/abs/2601.07290v1
- Date: Mon, 12 Jan 2026 07:51:37 GMT
- Title: VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
- Authors: Jiapeng Shi, Junke Wang, Zuyao You, Bo He, Zuxuan Wu,
- Abstract summary: VideoLoom is a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding.<n>We present LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions.<n>We also introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs.
- Score: 46.97966072048103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.
Related papers
- Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data [100.5266292850922]
Strefer is a synthetic data generation framework designed to equip Video Large Models with referring and reasoning capabilities.<n>Strefer produces diverse instruction-generation data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata.<n>Our approach enhances the ability of Video LLMs to interpret to spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions.
arXiv Detail & Related papers (2025-09-03T17:33:20Z) - DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding [19.50051728766238]
We propose an innovative video representation method called Dynamic-Image (DynImg)<n>Specifically, we introduce a set of non-key frames as temporal prompts to highlight the spatial areas containing fast-moving objects.<n>During the process of visual feature extraction, these prompts guide the model to pay additional attention to the fine-grained spatial features corresponding to these regions.
arXiv Detail & Related papers (2025-07-21T12:50:49Z) - Universal Video Temporal Grounding with Generative Multi-modal Large Language Models [59.781211641591405]
This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries.<n>We propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs)<n>Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries.
arXiv Detail & Related papers (2025-06-23T17:53:18Z) - SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability [58.46310813774538]
Large language models (LMLMs) have made remarkable progress in either temporal or spatial localization.<n>However they struggle to perform-temporal video grounding.<n>This limitation stems from two major challenges.<n>We introduce SpaceLM, a MLLMVL endowed with temporal-temporal video grounding.
arXiv Detail & Related papers (2025-03-18T07:40:36Z) - Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs [66.57518905079262]
VideoMind organizes critical video moments into aologically structured semantic graph.<n>"Mind Palace" organizes key information through (i) hand-object tracking, (ii) clustered zones activity representing specific areas of recurring activities, and (iii) environment layout mapping.
arXiv Detail & Related papers (2025-01-08T08:15:29Z) - VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.<n>However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.<n>We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.