ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models
- URL: http://arxiv.org/abs/2507.09876v1
- Date: Mon, 14 Jul 2025 03:21:13 GMT
- Title: ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models
- Authors: Yongheng Zhang, Xu Liu, Ruihan Tao, Qiguang Chen, Hao Fei, Wanxiang Che, Libo Qin,
- Abstract summary: Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning.<n>We introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT)<n>We show that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm.
- Score: 50.42183477287337
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning, and is fundamental to applications such as autonomous driving, embodied AI, and the broader pursuit of AGI. The rapid development of large language models (LLMs), particularly those utilizing Chain-of-Thought (CoT) technology, has significantly advanced video reasoning capabilities. However, current approaches primarily depend on textual information for reasoning, overlooking the visual modality in the actual video reasoning process. In contrast, humans naturally re-examine visual content while reasoning. Motivated by this, we introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive and cognitively aligned reasoning. To the end, first, we construct the Video-Text Interleaved Benchmark (ViTIB), which is created using MLLMs for key-video selection and manually verified. Furthermore, we extensively explore the potential of the ViTCoT paradigm in the video understanding field. Extensive experiments demonstrate that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm and effectively activates more neuron values in MLLMs.
Related papers
- SiLVR: A Simple Language-based Video Reasoning Framework [71.77141065418238]
We present SiLVR, a Simple Language-based Video Reasoning framework.<n>In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs.<n>In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks.
arXiv Detail & Related papers (2025-05-30T17:59:19Z) - ViSpeak: Visual Instruction Feedback in Streaming Videos [50.99067964073338]
We propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them.<n>We propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks.
arXiv Detail & Related papers (2025-03-17T03:05:31Z) - Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models [39.61402609070949]
Video Affective Reasoning (or Video Affective Reasoning) is a framework to predict and explain how a video would make a human feel.<n>We propose Stimuli-ML for Video Affective Reasoning (or Video Affective Reasoning) with Multi Large Language Models (LMLM)<n>We show that Stimuli-ML is superior to existing MLLMs in understanding viewers' emotional responses to videos and providing coherent and insightful explanations.
arXiv Detail & Related papers (2024-08-31T00:00:50Z) - FILS: Self-Supervised Video Feature Prediction In Semantic Language Space [11.641926922266347]
This paper demonstrates a self-supervised approach for learning semantic video representations.
We present FILS, a novel self-supervised video Feature prediction In semantic Language Space.
arXiv Detail & Related papers (2024-06-05T16:44:06Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Let's Think Frame by Frame with VIP: A Video Infilling and Prediction
Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings.
We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought.
We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.