Related papers: Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

URL: http://arxiv.org/abs/2512.04219v1
Date: Wed, 03 Dec 2025 19:41:06 GMT
Title: Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning
Authors: Zhou Chen, Joe Lin, Sathyanarayanan N. Aakur\\,
Abstract summary: We introduce PARSE, a unified framework that learns multiscale event structure directly from streaming video without supervision.<n>We show that PARSE achieves state-of-the-art performance among streaming methods and rivals offline baselines in both temporal alignment and structural consistency.
Score: 9.874456616326274
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Humans naturally perceive continuous experience as a hierarchy of temporally nested events, fine-grained actions embedded within coarser routines. Replicating this structure in computer vision requires models that can segment video not just retrospectively, but predictively and hierarchically. We introduce PARSE, a unified framework that learns multiscale event structure directly from streaming video without supervision. PARSE organizes perception into a hierarchy of recurrent predictors, each operating at its own temporal granularity: lower layers model short-term dynamics while higher layers integrate longer-term context through attention-based feedback. Event boundaries emerge naturally as transient peaks in prediction error, yielding temporally coherent, nested partonomies that mirror the containment relations observed in human event perception. Evaluated across three benchmarks, Breakfast Actions, 50 Salads, and Assembly 101, PARSE achieves state-of-the-art performance among streaming methods and rivals offline baselines in both temporal alignment (H-GEBD) and structural consistency (TED, hF1). The results demonstrate that predictive learning under uncertainty provides a scalable path toward human-like temporal abstraction and compositional event understanding.

Related papers

Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models [77.98801218316505]
Large language models (LLMs) exhibit emergent behaviors suggestive of human-like reasoning.<n>We investigate the internal processing of LLMs during in-context concept inference.
arXiv Detail & Related papers (2026-02-08T03:14:39Z)
Structured Episodic Event Memory [37.643537420763344]
We propose Structured Episodic Event Memory (SEEM), a hierarchical framework that synergizes a graph memory layer for relational facts with a dynamic episodic memory layer for narrative progression.<n> Experimental results on the LoCoMo and LongMemEval benchmarks demonstrate that SEEM significantly outperforms baselines.
arXiv Detail & Related papers (2026-01-10T03:17:25Z)
Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework. Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z)
Triplet Attention Transformer for Spatiotemporal Predictive Learning [9.059462850026216]
We propose an innovative triplet attention transformer designed to capture both inter-frame dynamics and intra-frame static features. The model incorporates the Triplet Attention Module (TAM), which replaces traditional recurrent units by exploring self-attention mechanisms in temporal, spatial, and channel dimensions.
arXiv Detail & Related papers (2023-10-28T12:49:33Z)
Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality [55.88910947643436]
Self-supervised pre-training is essential for handling vast quantities of unlabeled data in practice. HiDe-Prompt is an innovative approach that explicitly optimize the hierarchical components with an ensemble of task-specific prompts and statistics. Our experiments demonstrate the superior performance of HiDe-Prompt and its robustness to pre-training paradigms in continual learning.
arXiv Detail & Related papers (2023-10-11T06:51:46Z)
Graph-based Time Series Clustering for End-to-End Hierarchical Forecasting [18.069747511100132]
Relationships among time series can be exploited as inductive biases in learning effective forecasting models. We propose a graph-based methodology to unify relational and hierarchical inductive biases.
arXiv Detail & Related papers (2023-05-30T16:27:25Z)
Long-horizon video prediction using a dynamic latent hierarchy [1.2891210250935146]
We introduce Dynamic Latent (DLH) -- a latent model that represents videos as a hierarchy of latent states. DLH learns to disentangle representations across its hierarchy. We demonstrate that DLH outperforms state-of-the-art benchmarks in video prediction.
arXiv Detail & Related papers (2022-12-29T17:19:28Z)
Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp. SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)
An Empirical Study: Extensive Deep Temporal Point Process [61.14164208094238]
We first review recent research emphasis and difficulties in modeling asynchronous event sequences with deep temporal point process.<n>We propose a Granger causality discovery framework for exploiting the relations among multi-types of events.
arXiv Detail & Related papers (2021-10-19T10:15:00Z)
Joint Constrained Learning for Event-Event Relation Extraction [94.3499255880101]
We propose a joint constrained learning framework for modeling event-event relations. Specifically, the framework enforces logical constraints within and across multiple temporal and subevent relations. We show that our joint constrained learning approach effectively compensates for the lack of jointly labeled data.
arXiv Detail & Related papers (2020-10-13T22:45:28Z)
Learning to Abstract and Predict Human Actions [60.85905430007731]
We model the hierarchical structure of human activities in videos and demonstrate the power of such structure in action prediction. We propose Hierarchical-Refresher-Anticipator, a multi-level neural machine that can learn the structure of human activities by observing a partial hierarchy of events and roll-out such structure into a future prediction in multiple levels of abstraction.
arXiv Detail & Related papers (2020-08-20T23:57:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.