BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition
- URL: http://arxiv.org/abs/2601.00369v1
- Date: Thu, 01 Jan 2026 15:13:11 GMT
- Title: BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition
- Authors: Seungyeon Cho, Tae-kyun Kim,
- Abstract summary: This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains.<n>The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion
- Score: 12.973657570368317
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.
Related papers
- UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling [45.29560152294065]
UniHand is a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis.<n>Visual observations are encoded with a frozen backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features.<n>A latent diffusion model then synthesizes consistent motion sequences from diverse conditions.
arXiv Detail & Related papers (2026-02-25T06:53:15Z) - Uncertainty-Aware Multimodal Emotion Recognition through Dirichlet Parameterization [0.06596280437011041]
We present a lightweight and privacy-preserving Multimodal Emotion Recognition (MER) framework designed for deployment on edge devices.<n>Our implementation uses three modalities - speech, text and facial imagery.<n>We introduce a model- and task-agnostic fusion mechanism grounded in Dempster-Shafer theory and Dirichlet evidence.
arXiv Detail & Related papers (2026-02-09T19:12:30Z) - Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition [0.5964436882344729]
We propose PAN, the first human-centric graph representation learning framework for multimodal action recognition.<n>We present two variants: PAN-Ensemble, which employs dual-path graph convolution networks followed by late fusion, and PAN-Unified, which performs unified graph representation learning within a single network.
arXiv Detail & Related papers (2025-12-26T08:17:10Z) - Recursive Knowledge Synthesis for Multi-LLM Systems: Stability Analysis and Tri-Agent Audit Framework [0.0]
This paper presents a tri-agent cross-validation framework for analyzing stability and explainability in large language systems.<n>The architecture integrates three heterogeneous LLMs used for semantic generation, analytical consistency checking, and transparency auditing.
arXiv Detail & Related papers (2025-12-17T16:42:45Z) - Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach [8.505513517407993]
We propose a novel framework called the Dynamic Contrastive Dual-Path Network (D-HAR)<n>The framework comprises three key components. First, a dual-path feature extraction architecture is employed, where ResNet and DenseCDPNet branches collaboratively process multimodal sensor data.<n>Second, a multi-stage contrastive learning mechanism is introduced to achieve progressive alignment from local perception to semantic abstraction.<n>Third, we present a confidence-driven gradient modulation strategy that dynamically monitors and adjusts the learning intensity of each modality branch during backpropagation.
arXiv Detail & Related papers (2025-07-03T17:37:46Z) - Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z) - Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation [7.797154022794006]
Recent endeavors regard RGB modality as the center and the others as the auxiliary, yielding an asymmetric architecture with two branches.
We propose a novel method, named MAGIC, that can be flexibly paired with various backbones, ranging from compact to high-performance models.
Our method achieves state-of-the-art performance while reducing the model parameters by 60%.
arXiv Detail & Related papers (2024-07-16T03:19:59Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - SpatioTemporal Focus for Skeleton-based Action Recognition [66.8571926307011]
Graph convolutional networks (GCNs) are widely adopted in skeleton-based action recognition.
We argue that the performance of recent proposed skeleton-based action recognition methods is limited by the following factors.
Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information.
arXiv Detail & Related papers (2022-03-31T02:45:24Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.