Related papers: EgoLCD: Egocentric Video Generation with Long Context Diffusion

EgoLCD: Egocentric Video Generation with Long Context Diffusion

URL: http://arxiv.org/abs/2512.04515v1
Date: Thu, 04 Dec 2025 06:53:01 GMT
Title: EgoLCD: Egocentric Video Generation with Long Context Diffusion
Authors: Liuzhou Zhang, Jiarui Ye, Yuanlei Wang, Ming Zhong, Mingju Cao, Wanke Xia, Bowen Zeng, Zeyu Zhang, Hao Tang,
Abstract summary: EgoLCD is an end-to-end framework for egocentric long-context video generation.<n>It combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory.<n>EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency.
Score: 11.039806330368153
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.

Related papers

EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding [11.51428438970598]
EgoGraph is a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams.<n>We develop a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning.
arXiv Detail & Related papers (2026-02-27T06:20:58Z)
VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory [42.2374676860638]
Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally.<n>Maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition.<n>We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory.
arXiv Detail & Related papers (2025-12-04T07:06:02Z)
RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z)
GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory [59.869552603264076]
We introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding.<n>Our core innovation is the Schematic and Narrative Episodic Memory, which structurally models events and their causal and temporal relations into a concise, organized context.<n>Experiments confirm that GCAgent significantly enhances long-video understanding, achieving up to 23.5% accuracy improvement on the Video-MME Long split over a strong MLLM baseline.
arXiv Detail & Related papers (2025-11-15T04:29:00Z)
Pack and Force Your Memory: Long-form and Consistent Video Generation [26.53691150499802]
Long-form video generation presents a dual challenge.<n>Models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding.<n>MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation.
arXiv Detail & Related papers (2025-10-02T08:22:46Z)
LongLive: Real-time Interactive Long Video Generation [68.45945318075432]
LongLive is a frame-level autoregressive framework for real-time and interactive long video generation.<n>LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos.
arXiv Detail & Related papers (2025-09-26T17:48:24Z)
Mixture of Contexts for Long Video Generation [72.96361488755986]
We recast long-context video generation as an internal information retrieval task.<n>We propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine.<n>As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content.
arXiv Detail & Related papers (2025-08-28T17:57:55Z)
EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z)
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos [51.8995932557911]
EgoTempo is a dataset designed to evaluate temporal understanding in the egocentric domain.<n>We show that state-of-the-art Multi-Modal Large Language Models (MLLMs) on benchmarks achieve remarkably high performance using just text or a single frame as input.<n>We hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics.
arXiv Detail & Related papers (2025-03-17T18:50:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.