EgoLCD: Egocentric Video Generation with Long Context Diffusion
- URL: http://arxiv.org/abs/2512.04515v1
- Date: Thu, 04 Dec 2025 06:53:01 GMT
- Title: EgoLCD: Egocentric Video Generation with Long Context Diffusion
- Authors: Liuzhou Zhang, Jiarui Ye, Yuanlei Wang, Ming Zhong, Mingju Cao, Wanke Xia, Bowen Zeng, Zeyu Zhang, Hao Tang,
- Abstract summary: EgoLCD is an end-to-end framework for egocentric long-context video generation.<n>It combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory.<n>EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency.
- Score: 11.039806330368153
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.
Related papers
- EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding [11.51428438970598]
EgoGraph is a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams.<n>We develop a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning.
arXiv Detail & Related papers (2026-02-27T06:20:58Z) - VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory [42.2374676860638]
Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally.<n>Maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition.<n>We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory.
arXiv Detail & Related papers (2025-12-04T07:06:02Z) - RELIC: Interactive Video World Model with Long-Horizon Memory [74.81433479334821]
A truly interactive world model requires real-time long-horizon streaming, consistent spatial memory, and precise user control.<n>We present RELIC, a unified framework that tackles these three challenges altogether.<n>Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time.
arXiv Detail & Related papers (2025-12-03T18:29:20Z) - GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory [59.869552603264076]
We introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding.<n>Our core innovation is the Schematic and Narrative Episodic Memory, which structurally models events and their causal and temporal relations into a concise, organized context.<n>Experiments confirm that GCAgent significantly enhances long-video understanding, achieving up to 23.5% accuracy improvement on the Video-MME Long split over a strong MLLM baseline.
arXiv Detail & Related papers (2025-11-15T04:29:00Z) - Pack and Force Your Memory: Long-form and Consistent Video Generation [26.53691150499802]
Long-form video generation presents a dual challenge.<n>Models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding.<n>MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation.
arXiv Detail & Related papers (2025-10-02T08:22:46Z) - LongLive: Real-time Interactive Long Video Generation [68.45945318075432]
LongLive is a frame-level autoregressive framework for real-time and interactive long video generation.<n>LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos.
arXiv Detail & Related papers (2025-09-26T17:48:24Z) - Mixture of Contexts for Long Video Generation [72.96361488755986]
We recast long-context video generation as an internal information retrieval task.<n>We propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine.<n>As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content.
arXiv Detail & Related papers (2025-08-28T17:57:55Z) - EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z) - Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos [51.8995932557911]
EgoTempo is a dataset designed to evaluate temporal understanding in the egocentric domain.<n>We show that state-of-the-art Multi-Modal Large Language Models (MLLMs) on benchmarks achieve remarkably high performance using just text or a single frame as input.<n>We hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics.
arXiv Detail & Related papers (2025-03-17T18:50:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.