Related papers: Factorized Learning for Temporally Grounded Video-Language Models

Factorized Learning for Temporally Grounded Video-Language Models

URL: http://arxiv.org/abs/2512.24097v1
Date: Tue, 30 Dec 2025 09:13:20 GMT
Title: Factorized Learning for Temporally Grounded Video-Language Models
Authors: Wenzheng Zeng, Difei Gao, Mike Zheng Shou, Hwee Tou Ng,
Abstract summary: Two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy.<n>We propose D$2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency.
Score: 81.13591807802652
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at https://github.com/nusnlp/d2vlm.

Related papers

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs [60.224522472631776]
We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models.<n>Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video.<n>We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings.
arXiv Detail & Related papers (2025-10-19T22:12:45Z)
Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z)
VideoExplorer: Think With Videos For Agentic Long-Video Understanding [117.68219930263153]
Long-video understanding is a challenging problem in computer vision.<n>We propose VideoExplorer, a framework grounded in the principle of thinking with video''<n>Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding.
arXiv Detail & Related papers (2025-06-12T15:39:10Z)
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion. Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding [20.185272219985787]
Temporal grounding aims to locate a target video moment that semantically corresponds to the given sentence query in an untrimmed video. Previous methods do not reason the target moment locations based on the visual-textual semantic alignment but over-rely on the temporal biases of queries in training sets. This paper proposes a novel training framework for grounding models to use shuffled videos to address temporal bias problem without losing grounding accuracy.
arXiv Detail & Related papers (2022-07-29T14:11:48Z)
Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance. We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation [16.643709221279764]
We propose a novel pretext task -temporal overlap rate (STOR) prediction. It stems from observation that humans are capable of discriminating overlap rates of videos in space and time. We employ a joint task combining contrastive learning to further the enhance-temporal representation learning.
arXiv Detail & Related papers (2021-12-16T14:31:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.