Related papers: Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

URL: http://arxiv.org/abs/2509.15178v1
Date: Thu, 18 Sep 2025 17:35:50 GMT
Title: Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
Authors: Zaiquan Yang, Yuhao Liu, Gerhard Hancke, Rynson W. H. Lau,
Abstract summary: We use large language models (MLLMs) to explore a zero-shot solution in STVG.<n>We propose a MLLM-based zero-shot framework for STVG, which includes novel temporal-augmented assembling strategies.
Score: 47.400649582392255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatio-temporal video grounding (STVG) aims at localizing the spatio-temporal tube of a video, as specified by the input text query. In this paper, we utilize multimodal large language models (MLLMs) to explore a zero-shot solution in STVG. We reveal two key insights about MLLMs: (1) MLLMs tend to dynamically assign special tokens, referred to as \textit{grounding tokens}, for grounding the text query; and (2) MLLMs often suffer from suboptimal grounding due to the inability to fully integrate the cues in the text query (\textit{e.g.}, attributes, actions) for inference. Based on these insights, we propose a MLLM-based zero-shot framework for STVG, which includes novel decomposed spatio-temporal highlighting (DSTH) and temporal-augmented assembling (TAS) strategies to unleash the reasoning ability of MLLMs. The DSTH strategy first decouples the original query into attribute and action sub-queries for inquiring the existence of the target both spatially and temporally. It then uses a novel logit-guided re-attention (LRA) module to learn latent variables as spatial and temporal prompts, by regularizing token predictions for each sub-query. These prompts highlight attribute and action cues, respectively, directing the model's attention to reliable spatial and temporal related visual regions. In addition, as the spatial grounding by the attribute sub-query should be temporally consistent, we introduce the TAS strategy to assemble the predictions using the original video frames and the temporal-augmented frames as inputs to help improve temporal consistency. We evaluate our method on various MLLMs, and show that it outperforms SOTA methods on three common STVG benchmarks. The code will be available at https://github.com/zaiquanyang/LLaVA_Next_STVG.

Related papers

1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning [53.28271278708241]
We present a Detector-Empowered Video LLM, short for DEViL.<n> DEViL couples a Video LLM with an open-vocabulary detector (OVD)<n>Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding.
arXiv Detail & Related papers (2025-12-07T06:11:15Z)
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning [41.30900315121155]
multimodal large language models (LMs) underperform on STVG due to misaligned training objectives and weak fine-grained fine-word alignment in standard visual encoders.<n>We propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-temporal STVG performance without architectural modifications.
arXiv Detail & Related papers (2025-11-26T13:21:15Z)
Spatial Preference Rewarding for MLLMs Spatial Understanding [92.25703021388142]
Multimodal large language models (MLLMs) have demonstrated promising spatial understanding capabilities.<n>Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities.<n>We propose a Spatial Preference Rewarding(SPR) approach that enhances MLLMs' spatial capabilities.
arXiv Detail & Related papers (2025-10-16T07:16:18Z)
A Survey on Video Temporal Grounding with Multimodal Large Language Model [107.24431595873808]
Recent advancement in temporal grounding (VTG) has significantly enhanced fine-grained video understanding.<n>With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods.<n>Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce.
arXiv Detail & Related papers (2025-08-07T08:52:11Z)
Spatio-Temporal LLM: Reasoning about Environments and Actions [6.341762228330488]
"S-temporal" prompts challenge current Multimodal Large Language Models (MLLMs)<n>We show that recent MLLMs indeed struggle to correctly answer "s-temporal" prompts.<n>We build on this dataset to develop two-temporal LLM baselines.
arXiv Detail & Related papers (2025-07-07T17:59:55Z)
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability [58.46310813774538]
Large language models (LMLMs) have made remarkable progress in either temporal or spatial localization.<n>However they struggle to perform-temporal video grounding.<n>This limitation stems from two major challenges.<n>We introduce SpaceLM, a MLLMVL endowed with temporal-temporal video grounding.
arXiv Detail & Related papers (2025-03-18T07:40:36Z)
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.<n>We propose VRS-HQ, an end-to-end video reasoning segmentation approach.<n>Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z)
How Can Large Language Models Understand Spatial-Temporal Data? [12.968952073740796]
This paper introduces STG-LLM, an innovative approach empowering Large Language Models for spatial-temporal forecasting. We tackle the data mismatch by proposing: 1) STG-Tokenizer: This spatial-temporal graph tokenizer transforms intricate graph data into concise tokens capturing both spatial and temporal relationships; 2) STG-Adapter: This minimalistic adapter, consisting of linear encoding and decoding layers, bridges the gap between tokenized data and LLM comprehension.
arXiv Detail & Related papers (2024-01-25T14:03:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.