TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding
- URL: http://arxiv.org/abs/2508.07683v1
- Date: Mon, 11 Aug 2025 06:59:32 GMT
- Title: TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding
- Authors: Chaohong Guo, Xun Mo, Yongwei Nie, Xuemiao Xu, Chao Xu, Fei Yu, Chengjiang Long,
- Abstract summary: Temporal Video Grounding aims to precisely localize video segments corresponding to natural language queries.<n>We propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG)<n>TAR-TVG introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content.
- Score: 28.79516973256083
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal Video Grounding (TVG) aims to precisely localize video segments corresponding to natural language queries, which is a critical capability for long-form video understanding. Although existing reinforcement learning approaches encourage models to generate reasoning chains before predictions, they fail to explicitly constrain the reasoning process to ensure the quality of the final temporal predictions. To address this limitation, we propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG), a novel framework that introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content. These anchors serve as intermediate verification points. More importantly, we require each reasoning step to produce increasingly accurate temporal estimations, thereby ensuring that the reasoning process contributes meaningfully to the final prediction. To address the challenge of low-probability anchor generation in models (e.g., Qwen2.5-VL-3B), we develop an efficient self-distillation training strategy: (1) initial GRPO training to collect 30K high-quality reasoning traces containing multiple timestamp anchors, (2) supervised fine-tuning (SFT) on distilled data, and (3) final GRPO optimization on the SFT-enhanced model. This three-stage training strategy enables robust anchor generation while maintaining reasoning quality. Experiments show that our model achieves state-of-the-art performance while producing interpretable, verifiable reasoning chains with progressively refined temporal estimations.
Related papers
- Agentic Spatio-Temporal Grounding via Collaborative Reasoning [80.83158605034465]
Temporal Video Grounding aims to retrieve thetemporal tube of a target object or person in a video given a text query.<n>We propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario.<n>Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs)<n>Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin
arXiv Detail & Related papers (2026-02-10T10:16:27Z) - Process-of-Thought Reasoning for Videos [33.74677144833003]
Process-of-Thought (PoT) Reasoning for Videos is a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps.<n>PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence.
arXiv Detail & Related papers (2026-02-07T20:25:46Z) - Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z) - Video-LLMs with Temporal Visual Screening [59.18455762289321]
Temporal Visual Screening (TVS) is a new task that universally pre-processes video question answering and instruction tuning data.<n>TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines.<n> Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference)
arXiv Detail & Related papers (2025-08-27T14:33:32Z) - Reinforcing Video Reasoning Segmentation to Think Before It Segments [67.5703457389657]
We introduce Veason-R1, a specialized LVLM for video reasoning segmentation.<n>Veason-R1 is trained through Group Relative Policy Optimization (O) augmented with Chain-of-Thought trajectories.<n>We incorporate a holistic reward mechanism that enhances spatial alignment and temporal consistency.<n>Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins.
arXiv Detail & Related papers (2025-08-15T15:34:56Z) - Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs [12.295608604703117]
Time-R1 is a two-stage reinforcement fine-tuning framework designed to enhance multi-step reasoning ability of LLMs for time series forecasting.<n>Specifically, the first stage conducts supervised fine-tuning for warmup adaptation, while the second stage employs reinforcement learning to improve the model's generalization ability.<n> Experiments demonstrate that Time-R1 significantly improves forecast performance across diverse datasets.
arXiv Detail & Related papers (2025-06-12T12:15:50Z) - Enhancing LLM Reasoning for Time Series Classification by Tailored Thinking and Fused Decision [8.256998757769322]
ReasonTSC is a framework designed to leverage LLM reasoning for time series classification.<n>It steers the model to think over the essential characteristics of time series data.<n>It integrates predictions and confidence scores from plug-in classifiers, e.g., domain-specific time series models, as in-context examples.
arXiv Detail & Related papers (2025-06-01T03:15:54Z) - Generative Regression Based Watch Time Prediction for Short-Video Recommendation [36.95095097454143]
Watch time prediction has emerged as a pivotal task in short video recommendation systems.<n>Recent studies have attempted to address these issues by converting the continuous watch time estimation into an ordinal regression task.<n>We propose a novel Generative Regression (GR) framework that reformulates WTP as a sequence generation task.
arXiv Detail & Related papers (2024-12-28T16:48:55Z) - TimeRefine: Temporal Grounding with Time Refining Video LLM [75.99665302872901]
Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt.<n>We reformulate the temporal grounding task as a temporal refining task.<n>We incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth.
arXiv Detail & Related papers (2024-12-12T18:59:11Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding [90.21119832796136]
Temporal Video Grounding aims to localize a moment from an untrimmed video given the language description.
To avoid the drawbacks of the traditional paradigm, we propose AutoTVG, a new vision-language pre-training paradigm for TVG.
arXiv Detail & Related papers (2024-06-11T09:31:37Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.