Process-of-Thought Reasoning for Videos
- URL: http://arxiv.org/abs/2602.07689v1
- Date: Sat, 07 Feb 2026 20:25:46 GMT
- Title: Process-of-Thought Reasoning for Videos
- Authors: Jusheng Zhang, Kaitong Cai, Jian Wang, Yongsen Zheng, Kwok-Yan Lam, Keze Wang,
- Abstract summary: Process-of-Thought (PoT) Reasoning for Videos is a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps.<n>PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence.
- Score: 33.74677144833003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.
Related papers
- ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning [44.49803237328707]
ReVSeg executes reasoning as sequential decisions in the native interface of pretrained vision language models.<n>We employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals.
arXiv Detail & Related papers (2025-12-02T14:44:12Z) - Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding [56.7383554589569]
Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts.<n>We propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework.<n>We show that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.
arXiv Detail & Related papers (2025-11-30T09:27:59Z) - Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z) - When Thinking Drifts: Evidential Grounding for Robust Video Reasoning [68.75730050161219]
Chain-of-Thought (CoT) mechanism has enhanced reasoning in text-based tasks.<n>CoT often degrades performance in video reasoning, generating verbose but misleading internal monologues.<n>Visual Evidence Reward (VER) is a reinforcement learning framework that explicitly rewards the generation of reasoning traces that are verifiably grounded in visual evidence.
arXiv Detail & Related papers (2025-10-07T16:03:33Z) - TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding [28.79516973256083]
Temporal Video Grounding aims to precisely localize video segments corresponding to natural language queries.<n>We propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG)<n>TAR-TVG introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content.
arXiv Detail & Related papers (2025-08-11T06:59:32Z) - A Survey on Latent Reasoning [100.54120559169735]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities.<n>CoT reasoning that verbalizes intermediate steps limits the model's expressive bandwidth.<n>Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state.
arXiv Detail & Related papers (2025-07-08T17:29:07Z) - VITED: Video Temporal Evidence Distillation [49.38292490256531]
We investigate complex video question answering via chain-of-evidence reasoning.<n>Models struggle with multi-step reasoning as they uniformly sample a fixed number of frames.<n>We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains.
arXiv Detail & Related papers (2025-03-17T06:30:02Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.