Related papers: Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

URL: http://arxiv.org/abs/2603.02872v1
Date: Tue, 03 Mar 2026 11:24:55 GMT
Title: Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
Authors: Jialiang Zhang, Junlong Tong, Junyan Lin, Hao Wu, Yirong Sun, Yunpu Ma, Xiaoyu Shen,
Abstract summary: Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs.<n>To better match streaming inputs, we propose textbfThink-as-You-See (TaYS), a unified framework enabling true concurrent reasoning.
Score: 14.21980212001207
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS}{this repository.}

Related papers

TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning [39.81570843186615]
Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps.<n>We propose TwiFF, the first large-scale, temporally grounded VCoT dataset derived from $2.7$ million video clips.<n>We show that TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks.
arXiv Detail & Related papers (2026-02-11T09:20:04Z)
Rethinking Chain-of-Thought Reasoning for Videos [19.579424881079447]
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing.<n>Recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning.<n>Motivated by empirical observations, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning.
arXiv Detail & Related papers (2025-12-10T13:05:55Z)
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z)
StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA [60.86024022291499]
We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming video streams.<n>Our framework generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion.<n>This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference.
arXiv Detail & Related papers (2025-10-29T09:47:38Z)
StreamingThinker: Large Language Models Can Think While Reading [14.54868327561777]
Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning.<n>Inspired by human cognition of thinking while reading, we first design a textittextbfstreaming thinking paradigm for LLMs.<n>We instantiate this paradigm with textitStreamingThinker, a framework that enables LLMs to think while reading.
arXiv Detail & Related papers (2025-10-20T07:27:37Z)
Video-LLMs with Temporal Visual Screening [59.18455762289321]
Temporal Visual Screening (TVS) is a new task that universally pre-processes video question answering and instruction tuning data.<n>TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines.<n> Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference)
arXiv Detail & Related papers (2025-08-27T14:33:32Z)
FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers [63.788600404496115]
FullDiT2 is an efficient in-context conditioning framework for general controllability in both video generation and editing tasks.<n>FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step.
arXiv Detail & Related papers (2025-06-04T17:57:09Z)
LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding [29.586274567275012]
It is commonly assumed that the latter two mismatches require frequent re-encoding, indicating re-encoding outputs is largely unnecessary.<n>We introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes.<n>Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes.
arXiv Detail & Related papers (2025-05-22T17:53:28Z)
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning [35.64831081829936]
Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities.<n>VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization.<n>We show that VideoRFT achieves state-of-the-art performance on six video reasoning benchmarks.
arXiv Detail & Related papers (2025-05-18T14:14:35Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.