1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning
- URL: http://arxiv.org/abs/2512.06673v1
- Date: Sun, 07 Dec 2025 06:11:15 GMT
- Title: 1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning
- Authors: Shida Gao, Feng Xue, Xiangfeng Wang, Anlong Ming, Teng Long, Yihua Shao, Haozhe Wang, Zhaowen Lin, Wei Wang, Nicu Sebe,
- Abstract summary: We present a Detector-Empowered Video LLM, short for DEViL.<n> DEViL couples a Video LLM with an open-vocabulary detector (OVD)<n>Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding.
- Score: 53.28271278708241
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatio-temporal grounding and reasoning aims to locate the temporal segment and spatial region of an event in a video given a user query, while also reasoning about semantics such as causality, temporal order, and action relationships. To achieve this, current MLLMs primarily treats bounding boxes as text tokens and generates them autoregressively. However, such autoregressive spatial decoding leads to very-long output sequences, causing spatial errors to accumulated over time and the localization results to progressively drift across a video. To address this, we present a Detector-Empowered Video LLM, short for DEViL, which couples a Video LLM with an open-vocabulary detector (OVD). Specifically, the MLLM and detector are connected via a reference-semantic token (RST) that distills the user query into a rich semantic representation. Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding, enabling end-to-end learning of both referential understanding and spatial localization. Furthermore, we propose a tube-mined temporal regularization (TTReg) within OVD, which drives the OVD to generate temporally-consistent queries for target objects, thereby ensuring effective temporal association. Experiments demonstrate that DEViL achieves strong performance across various fine-grained video understanding tasks, particularly STVG and GroundedVQA. Code will be released on https://github.com/gaostar123/DeViL.
Related papers
- Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding [47.400649582392255]
We use large language models (MLLMs) to explore a zero-shot solution in STVG.<n>We propose a MLLM-based zero-shot framework for STVG, which includes novel temporal-augmented assembling strategies.
arXiv Detail & Related papers (2025-09-18T17:35:50Z) - Temporal Grounding as a Learning Signal for Referring Video Object Segmentation [29.646697516547558]
Referring Video Object (RVOS) aims to segment and track objects in videos based on natural language expressions, requiring precise alignment between visual content and textual queries.<n>Existing methods often suffer from semantic misalignment, largely due to indiscriminate frame sampling and supervision of all visible objects during training.<n>We introduce MeViS-M, a dataset built upon the challenging MeViS benchmark, where we manually annotate temporal spans when each object is referred to by the expression.
arXiv Detail & Related papers (2025-08-16T07:34:43Z) - Aligning Effective Tokens with Video Anomaly in Large Language Models [42.99603812716817]
We propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos.<n>Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules.<n>We construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs.
arXiv Detail & Related papers (2025-08-08T14:30:05Z) - Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification [47.40091830500585]
Video-based Visible-basedInfrared Person Re-Identification (VVIReID) aims to match pedestrian sequences across modalities by extracting modality-in sequence-level features.<n>A framework, video-level language-driven VVI-ReID (VLD), consists of two core modules: inmodality language (IMLP) and spatialtemporal aggregation.
arXiv Detail & Related papers (2025-06-03T04:49:08Z) - VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding [48.745013691038295]
VideoExpert is a general-purpose MLLM suitable for several temporal-sensitive video tasks.<n>The Temporal Expert is responsible for modeling time sequences and performing temporal grounding.<n>The Spatial Expert focuses on content detail analysis and instruction following.<n>By offloading temporal grounding from content generation, VideoExpert prevents text pattern biases in timestamp predictions.
arXiv Detail & Related papers (2025-04-10T07:33:39Z) - SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability [58.46310813774538]
Large language models (LMLMs) have made remarkable progress in either temporal or spatial localization.<n>However they struggle to perform-temporal video grounding.<n>This limitation stems from two major challenges.<n>We introduce SpaceLM, a MLLMVL endowed with temporal-temporal video grounding.
arXiv Detail & Related papers (2025-03-18T07:40:36Z) - Video LLMs for Temporal Reasoning in Long Videos [7.2900856926028155]
TemporalVLM is a video large language model capable of effective temporal reasoning and fine-grained understanding in long videos.<n>Our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues.<n>To facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes.
arXiv Detail & Related papers (2024-12-04T00:50:33Z) - Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references.
It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z) - Multimodal Transformer with Variable-length Memory for
Vision-and-Language Navigation [79.1669476932147]
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position.
Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction.
We introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation.
arXiv Detail & Related papers (2021-11-10T16:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.