Related papers: StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

URL: http://arxiv.org/abs/2510.25332v1
Date: Wed, 29 Oct 2025 09:47:38 GMT
Title: StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA
Authors: Yuhang Hu, Zhenyu Yang, Shihan Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Changsheng Xu,
Abstract summary: We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming video streams.<n>Our framework generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion.<n>This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference.
Score: 60.86024022291499
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.

Related papers

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models [14.21980212001207]
Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs.<n>To better match streaming inputs, we propose textbfThink-as-You-See (TaYS), a unified framework enabling true concurrent reasoning.
arXiv Detail & Related papers (2026-03-03T11:24:55Z)
Causality-Aware Temporal Projection for Video Understanding in Video-LLMs [14.297733965389959]
V-CORE is a parameter-efficient framework that introduces explicit temporal ordering constraints for video understanding.<n>With 4-bit QLoRA and a frozen LLM backbone, V-CORE can be trained efficiently on a single consumer GPU.<n>Experiments show that V-CORE achieves strong performance on the challenging NExT-QA benchmark, reaching 61.2% accuracy.
arXiv Detail & Related papers (2026-01-05T05:30:13Z)
BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [56.98981194478512]
We propose a unified framework that handles a broad range of subject-to-video scenarios.<n>We introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities.<n>Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos.
arXiv Detail & Related papers (2025-10-01T02:41:11Z)
Causality Matters: How Temporal Information Emerges in Video Language Models [17.570777893613137]
We find that removing or modifying positional encodings in video inputs yields minimal degradation in the performance of temporal understanding.<n>To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model.<n>We propose two efficiency-oriented strategies: staged cross-modal attention and a temporal exit mechanism for early token truncation.
arXiv Detail & Related papers (2025-08-15T16:33:14Z)
LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z)
CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos [59.391265901911005]
We propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address complex challenges by temporal-semantic reasoning.<n>CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding for each object that can be observed effortlessly among all frames (temporal)<n>Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge
arXiv Detail & Related papers (2025-05-24T07:01:31Z)
Enhanced Partially Relevant Video Retrieval through Inter- and Intra-Sample Analysis with Coherence Prediction [18.24629930062925]
Partially Relevant Video Retrieval aims to retrieve the target video that is partially relevant to a text query.<n>Existing methods coarsely align paired videos and text queries to construct the semantic space.<n>We propose a novel PRVR framework to systematically exploit inter-sample correlation and intra-sample redundancy.
arXiv Detail & Related papers (2025-04-28T09:52:46Z)
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning [40.18308199837137]
We introduce a Video S-Temporal Reasoning (V-STa) benchmark to address these shortcomings.<n>We construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs.<n>Experiments from 14 Video-LLMs reveal significant gaps between current Video-LLMs and the needs for robust and consistent consistent reasoning.
arXiv Detail & Related papers (2025-03-14T15:21:44Z)
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding [14.464718780172582]
We introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling.<n>We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding.<n>Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance.
arXiv Detail & Related papers (2025-03-11T16:21:23Z)
TimeLogic: A Temporal Logic Benchmark for Video QA [64.32208175236323]
We introduce the TimeLogic QA (TLQA) framework to automatically generate temporal logical questions.<n>We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate 2k and 10k QA pairs for each category.<n>We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
arXiv Detail & Related papers (2025-01-13T11:12:59Z)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.