Related papers: Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

URL: http://arxiv.org/abs/2507.00752v1
Date: Tue, 01 Jul 2025 13:55:57 GMT
Title: Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation
Authors: Hao Xing, Kai Zhe Boey, Yuankai Wu, Darius Burschka, Gordon Cheng,
Abstract summary: temporal segmentation of human actions is critical for intelligent robots in collaborative settings.<n>We propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data.<n>Our approach outperforms state-of-the-art methods, especially in action segmentation accuracy.
Score: 10.122882293302787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.

Related papers

Binary-Gaussian: Compact and Progressive Representation for 3D Gaussian Segmentation [83.90109373769614]
3D Gaussian Splatting (3D-GS) has emerged as an efficient 3D representation and a promising foundation for semantic tasks like segmentation.<n>We propose a coarse-to-fine binary encoding scheme for per-Gaussian category representation, which compresses each feature into a single integer via the binary-to-decimal mapping.<n>We further design a progressive training strategy that decomposes panoptic segmentation into a series of independent sub-tasks, reducing inter-class conflicts and thereby enhancing fine-grained segmentation capability.
arXiv Detail & Related papers (2025-11-30T15:51:30Z)
Puppeteer: Rig and Animate Your 3D Models [105.11046762553121]
Puppeteer is a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects.<n>Our system first predicts plausible skeletal structures via an auto-regressive transformer.<n>It then infers skinning weights via an attention-based architecture.
arXiv Detail & Related papers (2025-08-14T17:59:31Z)
Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network [2.223052975765005]
We propose a novel Pyramid Graph Convolutional Network (PGCN) to automatically recognize human-object interaction. The system represents the 2D or 3D spatial relation of human and objects from the detection results in video data as a graph. We evaluate our model on two challenging datasets in the field of human-object interaction recognition.
arXiv Detail & Related papers (2024-10-10T13:39:17Z)
S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR) Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z)
Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos [91.44553585470688]
Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond. We propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner. Experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively.
arXiv Detail & Related papers (2023-08-20T18:23:07Z)
Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features. Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z)
A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion [13.88656793940129]
We propose an adaptive, identity-aware triangulation module to reconstruct 3D joints and identify the missing joints for each identity. We then propose a Dual-Masked Auto-Encoder (D-MAE) which encodes the joint status with both skeletal-structural and temporal position encoding for trajectory completion. In order to demonstrate the proposed model's capability in dealing with severe data loss scenarios, we contribute a high-accuracy and challenging motion capture dataset.
arXiv Detail & Related papers (2022-07-15T10:00:43Z)
LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence. Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence. We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z)
Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification. We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions. We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z)
Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body. It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
Group-Skeleton-Based Human Action Recognition in Complex Events [15.649778891665468]
We propose a novel group-skeleton-based human action recognition method in complex events. This method first utilizes multi-scale spatial-temporal graph convolutional networks (MS-G3Ds) to extract skeleton features from multiple persons. Results on the HiEve dataset show that our method can give superior performance compared to other state-of-the-art methods.
arXiv Detail & Related papers (2020-11-26T13:19:14Z)
Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.