Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline
- URL: http://arxiv.org/abs/2603.05484v1
- Date: Thu, 05 Mar 2026 18:52:12 GMT
- Title: Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline
- Authors: Guo Chen, Lidong Lu, Yicheng Liu, Liangrui Dong, Lidong Zou, Jixin Lv, Zhenquan Li, Xinyi Mao, Baoqi Pei, Shihao Wang, Zhiqi Li, Karan Sapra, Fuxiao Liu, Yin-Dong Zheng, Yifei Huang, Limin Wang, Zhiding Yu, Andrew Tao, Guilin Liu, Tong Lu,
- Abstract summary: MM-Lifelong is a dataset designed for Multimodal Lifelong Understanding.<n>Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities.
- Score: 58.585692088008905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.
Related papers
- UniDiff: A Unified Diffusion Framework for Multimodal Time Series Forecasting [90.47915032778366]
We propose UniDiff, a unified diffusion framework for multimodal time series forecasting.<n>At its core lies a unified and parallel fusion module, where a single cross-attention mechanism integrates structural information from timestamps and semantic context from texts.<n>Experiments on real-world benchmark datasets across eight domains demonstrate that the proposed UniDiff model achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-12-08T05:36:14Z) - GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory [59.869552603264076]
We introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding.<n>Our core innovation is the Schematic and Narrative Episodic Memory, which structurally models events and their causal and temporal relations into a concise, organized context.<n>Experiments confirm that GCAgent significantly enhances long-video understanding, achieving up to 23.5% accuracy improvement on the Video-MME Long split over a strong MLLM baseline.
arXiv Detail & Related papers (2025-11-15T04:29:00Z) - RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility [9.200793414310182]
We introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a unified framework for predicting human mobility.<n>We use large language models (LLMs) as general-purpose predictors and reasoners.<n> RHYTHM achieves a 2.4% in overall accuracy, a 5.0% increase on weekends, and a 24.6% reduction in training time.
arXiv Detail & Related papers (2025-09-27T04:55:56Z) - Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z) - Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding [18.027290155746112]
Temporal Search is a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively.<n>It is based on a key observation: the model's generation confidence across different temporal intervals is highly correlated with prediction accuracy.<n>It refines the focus of the model by iteratively shifting attention to more fine-grained temporal intervals, improving its understanding of long videos.
arXiv Detail & Related papers (2025-06-28T15:24:05Z) - DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs [5.074812070492738]
We introduce DaMO, a data-efficient Video LLM specifically designed for accurate temporal reasoning and multimodal understanding.<n>We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities.<n>Our work establishes a promising direction for data-efficient video-language modeling.
arXiv Detail & Related papers (2025-06-13T08:13:05Z) - HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics [32.117677036812836]
This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics.<n>Two versatile modules can enhance existing video-language models or operate as a standalone system.<n> HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings.
arXiv Detail & Related papers (2024-08-30T17:52:55Z) - Toward Time-Continuous Data Inference in Sparse Urban CrowdSensing [5.105223708885987]
Mobile Crowd Sensing (MCS) is a promising paradigm that leverages mobile users and their smart portable devices to perform various real-world tasks.
Sparse MCS has emerged as a more practical alternative, collecting data from a limited number of target subareas and utilizing inference algorithms to complete the full sensing map.
In this paper, we go from fine-grained completion, i.e., the subdivision of sensing cycles into minimal time units, towards a more accurate, time-continuous completion.
arXiv Detail & Related papers (2024-08-27T19:25:41Z) - A Practitioner's Guide to Continual Multimodal Pretraining [83.63894495064855]
Multimodal foundation models serve numerous applications at the intersection of vision and language.<n>To keep models updated, research into continual pretraining mainly explores scenarios with either infrequent, indiscriminate updates on large-scale new data, or frequent, sample-level updates.<n>We introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements.
arXiv Detail & Related papers (2024-08-26T17:59:01Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.