Related papers: TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

URL: http://arxiv.org/abs/2512.03963v2
Date: Thu, 04 Dec 2025 02:51:15 GMT
Title: TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
Authors: Tao Wu, Li Yang, Gen Zhan, Yabin Zhang, Yiting Liao, Junlin Li, Deliang Fu, Li Zhang, Limin Wang,
Abstract summary: Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis.<n>We present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs' temporal comprehension.
Score: 25.848638804759872
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs' temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.

Related papers

Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline [58.585692088008905]
MM-Lifelong is a dataset designed for Multimodal Lifelong Understanding.<n>Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities.
arXiv Detail & Related papers (2026-03-05T18:52:12Z)
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling [85.590774707406]
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs.<n>We introduce UniT, a framework for multimodal test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds.
arXiv Detail & Related papers (2026-02-12T18:59:49Z)
From Consistency to Complementarity: Aligned and Disentangled Multi-modal Learning for Time Series Understanding and Reasoning [12.903267405917388]
We propose MADI, a multi-modal large language model (MLLM) enhanced with fine-grained alignment and disentangled interaction.<n>Experiments on synthetic and real-world benchmarks show that MADI consistently outperforms general-purpose LLMs and time-series-specialized MLLMs.
arXiv Detail & Related papers (2026-01-29T09:13:46Z)
Enhancing Temporal Awareness in LLMs for Temporal Point Processes [53.596733432865626]
Temporal point processes (TPPs) are crucial for analyzing events over time.<n> TPP-TAL is a novel plug-and-play framework designed to enhance temporal reasoning within large language models.<n> TPP-TAL delivers substantial improvements in temporal likelihood estimation and event prediction accuracy.
arXiv Detail & Related papers (2025-12-29T03:01:24Z)
Eliciting Chain-of-Thought Reasoning for Time Series Analysis using Reinforcement Learning [2.426309874608745]
Complex numerical time series analysis often demands multi-step reasoning capabilities beyond current models' reach.<n>We introduce Chain Of thought for Understanding Numerical Time Series (COUNTS), the first framework that trains large language models to perform Chain-of-Thought (CoT) reasoning across diverse time series tasks using reinforcement learning (RL) with verifiable rewards.<n>Our experiments demonstrate that this RL-driven approach with intermediate CoT reasoning significantly enhances LLM performance across various time series analysis tasks, opening new possibilities for complex temporal data reasoning.
arXiv Detail & Related papers (2025-10-01T17:02:28Z)
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs [5.074812070492738]
We introduce DaMO, a data-efficient Video LLM specifically designed for accurate temporal reasoning and multimodal understanding.<n>We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities.<n>Our work establishes a promising direction for data-efficient video-language modeling.
arXiv Detail & Related papers (2025-06-13T08:13:05Z)
Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition [43.84348967231349]
Few-shot action recognition aims to recognize novel action categories with few exemplars.<n>Existing methods typically learn frame-level representations for each video by designing inter-frame temporal modeling strategies.<n>We propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR.
arXiv Detail & Related papers (2025-04-14T10:23:22Z)
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling.<n>Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning.<n>We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z)
LLM-PS: Empowering Large Language Models for Time Series Forecasting with Temporal Patterns and Semantics [56.99021951927683]
Time Series Forecasting (TSF) is critical in many real-world domains like financial planning and health monitoring.<n>Existing Large Language Models (LLMs) usually perform suboptimally because they neglect the inherent characteristics of time series data.<n>We propose LLM-PS to empower the LLM for TSF by learning the fundamental textitPatterns and meaningful textitSemantics from time series data.
arXiv Detail & Related papers (2025-03-12T11:45:11Z)
TempoGPT: Enhancing Time Series Reasoning via Quantizing Embedding [13.996105878417204]
We propose a multi-modal time series data construction approach and a multi-modal time series language model (TLM), TempoGPT.<n>We construct multi-modal data for complex reasoning tasks by analyzing the variable-system relationships within a white-box system.<n>Extensive experiments demonstrate that TempoGPT accurately perceives temporal information, logically infers conclusions, and achieves state-of-the-art in the constructed complex time series reasoning tasks.
arXiv Detail & Related papers (2025-01-13T13:47:05Z)
Multi-Patch Prediction: Adapting LLMs for Time Series Representation Learning [22.28251586213348]
aLLM4TS is an innovative framework that adapts Large Language Models (LLMs) for time-series representation learning. A distinctive element of our framework is the patch-wise decoding layer, which departs from previous methods reliant on sequence-level decoding.
arXiv Detail & Related papers (2024-02-07T13:51:26Z)
Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge. We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z)
USTEP: Spatio-Temporal Predictive Learning under A Unified View [62.58464029270846]
We introduce USTEP (Unified S-TEmporal Predictive learning), an innovative framework that reconciles the recurrent-based and recurrent-free methods by integrating both micro-temporal and macro-temporal scales.
arXiv Detail & Related papers (2023-10-09T16:17:42Z)
Interpretable Time-series Representation Learning With Multi-Level Disentanglement [56.38489708031278]
Disentangle Time Series (DTS) is a novel disentanglement enhancement framework for sequential data. DTS generates hierarchical semantic concepts as the interpretable and disentangled representation of time-series. DTS achieves superior performance in downstream applications, with high interpretability of semantic concepts.
arXiv Detail & Related papers (2021-05-17T22:02:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.