Related papers: Video Finetuning Improves Reasoning Between Frames

Video Finetuning Improves Reasoning Between Frames

URL: http://arxiv.org/abs/2511.12868v1
Date: Mon, 17 Nov 2025 01:51:57 GMT
Title: Video Finetuning Improves Reasoning Between Frames
Authors: Ruiqi Yang, Tian Yun, Zihan Wang, Ellie Pavlick,
Abstract summary: We propose Visual Chain-of-Thought, an explicit reasoning process that generates transitional event descriptions between consecutive frames.<n>Our experiments show that vCoT significantly improves the performance of image-only models on long-form video question answering.<n>We find that video models transfer this temporal reasoning ability to purely static settings, outperforming image models' baselines on visual reasoning tasks.
Score: 23.676284017808218
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (LLMs) have made rapid progress in visual understanding, yet their extension from images to videos often reduces to a naive concatenation of frame tokens. In this work, we investigate what video finetuning brings to multimodal LLMs. We propose Visual Chain-of-Thought (vCoT), an explicit reasoning process that generates transitional event descriptions between consecutive frames. Using vCoT, we systematically compare image-only LVLMs with their video-finetuned counterparts, both with and without access to these transitional cues. Our experiments show that vCoT significantly improves the performance of image-only models on long-form video question answering, while yielding only marginal gains for video-finetuned models. This suggests that the latter already capture frame-to-frame transitions implicitly. Moreover, we find that video models transfer this temporal reasoning ability to purely static settings, outperforming image models' baselines on relational visual reasoning tasks.

Related papers

Episodic Memory Representation for Long-form Video Understanding [52.33907540905242]
Large Video Language Models excel at general video understanding but struggle with long-form context window limits.<n>We introduce Video-EM, a training free framework inspired by the principles of human memory.<n>Video-EM achieves performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
arXiv Detail & Related papers (2025-08-13T04:33:07Z)
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning [37.86612817818566]
We propose to obtain video LLMs whose reasoning steps are grounded in, and explicitly refer to, the relevant video frames.<n>Our approach is simple and self-contained, and, unlike existing approaches for video CoT, does not require auxiliary networks to select or caption relevant frames.<n>This, in turn, leads to improved performance across multiple video understanding benchmarks.
arXiv Detail & Related papers (2025-05-31T00:08:21Z)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Continuous Video Process: Modeling Videos as Continuous Multi-Dimensional Processes for Video Prediction [43.16308241800144]
We introduce a novel model class, that treats video as a continuous multi-dimensional process rather than a series of discrete frames.<n>We establish state-of-the-art performance in video prediction, validated on benchmark datasets including KTH, BAIR, Human3.6M, and UCF101.
arXiv Detail & Related papers (2024-12-06T10:34:50Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
Multi-entity Video Transformers for Fine-Grained Video Representation Learning [34.26732761916984]
We re-examine the design of transformer architectures for video representation learning.<n>A key aspect of our approach is the improved sharing of scene information in the temporal pipeline.<n>Our Multi-entity Video Transformer (MV-Former) processes the frames as groups of entities represented as tokens linked across time.
arXiv Detail & Related papers (2023-11-17T21:23:12Z)
Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time. This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs) We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z)
Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z)
Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Existing approaches usually align and aggregate video frames from limited adjacent frames. We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.