Related papers: Video Understanding: Through A Temporal Lens

Video Understanding: Through A Temporal Lens

URL: http://arxiv.org/abs/2602.00683v1
Date: Sat, 31 Jan 2026 12:01:09 GMT
Title: Video Understanding: Through A Temporal Lens
Authors: Thong Thanh Nguyen,
Abstract summary: This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding.<n>The work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using "recurrent adapters" to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers for efficient long-form video modeling; and (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments.
Score: 5.153774021264937
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using "recurrent adapters" to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers (SSL) for efficient long-form video modeling, supported by the introduction of two new long-term benchmarks for egocentric and feature-length content; (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments; and (5) a comprehensive empirical study on Large Vision-Language Models (LVLMs) that identifies the visual-language interface as a bottleneck for temporal reasoning, leading to a new "temporal-oriented recipe" for upscaled video understanding. Collectively, these contributions demonstrate that explicit temporal modeling significantly enhances a model's ability to represent and reason about the fluid nature of video content.

Related papers

Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z)
Kwai Keye-VL 1.5 Technical Report [91.07838286692815]
We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations.<n>First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity.<n>Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens.<n>Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment.
arXiv Detail & Related papers (2025-09-01T15:46:58Z)
DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment [17.85550556489256]
This paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA)<n>A Video-Based Temporal CLIP module is proposed to explicitly model temporal dynamics and enhance motion perception, aligning with the dorsal stream.<n>A Temporal Context Module is developed to refine inter-frame dependencies, further improving motion modeling.<n>Finally, a text-guided adaptive fusion strategy is proposed to enable more effective integration of spatial and temporal information.
arXiv Detail & Related papers (2025-04-16T03:20:28Z)
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding [23.477954901326978]
Existing approaches adopt either implicit temporal modeling, relying solely on the decoder, or explicit temporal modeling, employing auxiliary temporal encoders.<n>We propose the explicit Temporal (STE) to enable flexible explicit temporal modeling with adjustable receptive temporal fields and token compression ratios.<n>Our findings emphasize the critical role of explicit temporal modeling, providing actionable insights to advance video MLLMs.
arXiv Detail & Related papers (2025-01-28T08:30:58Z)
Temporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models [44.99833362998488]
Temporal Semantic Alignment via Dynamic Prompting (TSADP) is a novel framework that enhances temporal reasoning capabilities.<n>We evaluate TSADP on the VidSitu dataset, augmented with enriched temporal annotations.<n>Our analysis highlights the robustness, efficiency, and practical utility of TSADP, making it a step forward in the field of video-language understanding.
arXiv Detail & Related papers (2024-12-16T02:37:58Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z)
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs. We revisit temporal modeling in the context of image-to-video knowledge transferring. We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z)
TCGL: Temporal Contrastive Graph for Self-supervised Video Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL) Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG) To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z)
Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization [30.670109727802494]
This paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations. Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding.
arXiv Detail & Related papers (2021-08-04T17:16:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.