EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
- URL: http://arxiv.org/abs/2602.23709v1
- Date: Fri, 27 Feb 2026 06:20:58 GMT
- Title: EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
- Authors: Shitong Sun, Ke Han, Yukai Huang, Weitong Cai, Jifei Song,
- Abstract summary: EgoGraph is a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams.<n>We develop a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning.
- Score: 11.51428438970598
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ultra-long egocentric videos spanning multiple days present significant challenges for video understanding. Existing approaches still rely on fragmented local processing and limited temporal modeling, restricting their ability to reason over such extended sequences. To address these limitations, we introduce EgoGraph, a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams. EgoGraph employs a novel egocentric schema that unifies the extraction and abstraction of core entities, such as people, objects, locations, and events, and structurally reasons about their attributes and interactions, yielding a significantly richer and more coherent semantic representation than traditional clip-based video models. Crucially, we develop a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning. Extensive experiments on the EgoLifeQA and EgoR1-bench benchmarks demonstrate that EgoGraph achieves state-of-the-art performance on long-term video question answering, validating its effectiveness as a new paradigm for ultra-long egocentric video understanding.
Related papers
- Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention [58.05340906967343]
Egocentric Referring Video Object (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos.<n>Existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets.<n>We introduce Causal-REferring (CERES), a plug-in causal framework that adapts strong, pre-trained RVOSs to the egocentric domain.
arXiv Detail & Related papers (2025-12-30T16:22:14Z) - EgoLCD: Egocentric Video Generation with Long Context Diffusion [11.039806330368153]
EgoLCD is an end-to-end framework for egocentric long-context video generation.<n>It combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory.<n>EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency.
arXiv Detail & Related papers (2025-12-04T06:53:01Z) - EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT [56.24624833924252]
EgoThinker is a framework that endows MLs with robust egocentric reasoning capabilities through-temporal chain-of-thought supervision and a two-stage learning curriculum.<n>EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained-temporal localization tasks.
arXiv Detail & Related papers (2025-10-27T17:38:17Z) - Infinite Video Understanding [50.78256932424239]
We argue that framing Infinite Video Understanding as a blue-sky research objective provides a vital north star for the multimedia.<n>We outline the core challenges and key research directions towards achieving this transformative capability.
arXiv Detail & Related papers (2025-07-11T23:07:04Z) - Keystep Recognition using Graph Neural Networks [11.421362760480527]
We propose a flexible graph-learning framework for keystep recognition in egocentric videos.<n>The constructed graphs are sparse and computationally efficient, outperforming existing larger models substantially.<n>We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods.
arXiv Detail & Related papers (2025-06-01T17:54:58Z) - Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos [51.8995932557911]
EgoTempo is a dataset designed to evaluate temporal understanding in the egocentric domain.<n>We show that state-of-the-art Multi-Modal Large Language Models (MLLMs) on benchmarks achieve remarkably high performance using just text or a single frame as input.<n>We hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics.
arXiv Detail & Related papers (2025-03-17T18:50:36Z) - Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence.<n>Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z) - Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions? [48.702973928321946]
Egocentric video-language pretraining is a crucial step in advancing the understanding of hand-object interactions in first-person scenarios.<n>Despite successes on existing testbeds, we find that current EgoVLMs can be easily misled by simple modifications.<n>This raises the question: Do EgoVLMs truly understand hand-object interactions?
arXiv Detail & Related papers (2024-05-28T00:27:29Z) - Action Scene Graphs for Long-Form Understanding of Egocentric Videos [23.058999979457546]
We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos.
EASGs provide a temporally evolving graph-based description of the actions performed by the camera wearer.
We will release the dataset and the code to replicate experiments and annotations.
arXiv Detail & Related papers (2023-12-06T10:01:43Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Temporal Relational Modeling with Self-Supervision for Action
Segmentation [38.62057004624234]
We introduce Dilated Temporal Graph Reasoning Module (DTGRM) to model temporal relations in video.
In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs.
Our model outperforms state-of-the-art action segmentation models on three challenging datasets.
arXiv Detail & Related papers (2020-12-14T13:41:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.