Related papers: Towards Open-World Human Action Segmentation Using Graph Convolutional Networks

Towards Open-World Human Action Segmentation Using Graph Convolutional Networks

URL: http://arxiv.org/abs/2507.00756v1
Date: Tue, 01 Jul 2025 14:00:39 GMT
Title: Towards Open-World Human Action Segmentation Using Graph Convolutional Networks
Authors: Hao Xing, Kai Zhe Boey, Gordon Cheng,
Abstract summary: Most existing learning-based methods excel in closed-world action segmentation.<n>We propose a structured framework for detecting and segmenting unseen actions.<n>We evaluate our framework on two challenging human-object recognition datasets.
Score: 6.167678490008973
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human-object interaction segmentation is a fundamental task of daily activity understanding, which plays a crucial role in applications such as assistive robotics, healthcare, and autonomous systems. Most existing learning-based methods excel in closed-world action segmentation, they struggle to generalize to open-world scenarios where novel actions emerge. Collecting exhaustive action categories for training is impractical due to the dynamic diversity of human activities, necessitating models that detect and segment out-of-distribution actions without manual annotation. To address this issue, we formally define the open-world action segmentation problem and propose a structured framework for detecting and segmenting unseen actions. Our framework introduces three key innovations: 1) an Enhanced Pyramid Graph Convolutional Network (EPGCN) with a novel decoder module for robust spatiotemporal feature upsampling. 2) Mixup-based training to synthesize out-of-distribution data, eliminating reliance on manual annotations. 3) A novel Temporal Clustering loss that groups in-distribution actions while distancing out-of-distribution samples. We evaluate our framework on two challenging human-object interaction recognition datasets: Bimanual Actions and 2 Hands and Object (H2O) datasets. Experimental results demonstrate significant improvements over state-of-the-art action segmentation models across multiple open-set evaluation metrics, achieving 16.9% and 34.6% relative gains in open-set segmentation (F1@50) and out-of-distribution detection performances (AUROC), respectively. Additionally, we conduct an in-depth ablation study to assess the impact of each proposed component, identifying the optimal framework configuration for open-world action segmentation.

Related papers

NeurNCD: Novel Class Discovery via Implicit Neural Representation [4.498082064000176]
NeurNCD is a versatile and data-efficient framework for novel class discovery.<n>Our framework achieves superior segmentation performance in both open and closed-world settings.<n>Our method significantly outperforms state-of-the-art approaches on the NYUv2 and Replica datasets.
arXiv Detail & Related papers (2025-06-06T16:43:34Z)
One-shot In-context Part Segmentation [97.77292483684877]
We present the One-shot In-context Part (OIParts) framework to tackle the challenges of part segmentation.<n>Our framework offers a novel approach to part segmentation that is training-free, flexible, and data-efficient.<n>We have achieved remarkable segmentation performance across diverse object categories.
arXiv Detail & Related papers (2025-03-03T03:50:54Z)
ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP) ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective. We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z)
Multi-body SE(3) Equivariance for Unsupervised Rigid Segmentation and Motion Estimation [49.56131393810713]
We present an SE(3) equivariant architecture and a training strategy to tackle this task in an unsupervised manner. Our method excels in both model performance and computational efficiency, with only 0.25M parameters and 0.92G FLOPs.
arXiv Detail & Related papers (2023-06-08T22:55:32Z)
Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training. We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z)
Leveraging triplet loss for unsupervised action segmentation [0.0]
We propose a fully unsupervised framework that learns action representations suitable for the action segmentation task from the single input video itself. Our method is a deep metric learning approach rooted in a shallow network with a triplet loss operating on similarity distributions. Under these circumstances, we successfully recover temporal boundaries in the learned action representations with higher quality compared with existing unsupervised approaches.
arXiv Detail & Related papers (2023-04-13T11:10:16Z)
Adaptive Local-Component-aware Graph Convolutional Network for One-shot Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition. Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z)
Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage. We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets. By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z)
Representation-Centric Survey of Skeletal Action Recognition and the ANUBIS Benchmark [43.00059447663327]
3D skeleton-based human action recognition has emerged as a powerful alternative to traditional RGB and depth-based approaches.<n>Despite remarkable progress, current research remains fragmented across diverse input representations.<n>ANUBIS is a large-scale, challenging skeleton action dataset designed to address critical gaps in existing benchmarks.
arXiv Detail & Related papers (2022-05-04T14:03:43Z)
ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training. We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods. Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z)
Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization [74.34699679568818]
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. We propose a cross-modal consensus network (CO2-Net) to tackle this problem.
arXiv Detail & Related papers (2021-07-27T04:21:01Z)
Unsupervised Action Segmentation with Self-supervised Feature Learning and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label. In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos. We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.