OpenHuman4D: Open-Vocabulary 4D Human Parsing
- URL: http://arxiv.org/abs/2507.09880v2
- Date: Sat, 26 Jul 2025 02:39:46 GMT
- Title: OpenHuman4D: Open-Vocabulary 4D Human Parsing
- Authors: Keito Suzuki, Bang Du, Runfa Blark Li, Kunyao Chen, Lei Wang, Peng Liu, Ning Bi, Truong Nguyen,
- Abstract summary: We introduce the first 4D human parsing framework that reduces inference time and introduces open-vocabulary capabilities.<n>Building upon state-of-the-art open-vocabulary 3D human parsing techniques, our approach extends the support to 4D human-centric video.
- Score: 7.533936292165496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding dynamic 3D human representation has become increasingly critical in virtual and extended reality applications. However, existing human part segmentation methods are constrained by reliance on closed-set datasets and prolonged inference times, which significantly restrict their applicability. In this paper, we introduce the first 4D human parsing framework that simultaneously addresses these challenges by reducing the inference time and introducing open-vocabulary capabilities. Building upon state-of-the-art open-vocabulary 3D human parsing techniques, our approach extends the support to 4D human-centric video with three key innovations: 1) We adopt mask-based video object tracking to efficiently establish spatial and temporal correspondences, avoiding the necessity of segmenting all frames. 2) A novel Mask Validation module is designed to manage new target identification and mitigate tracking failures. 3) We propose a 4D Mask Fusion module, integrating memory-conditioned attention and logits equalization for robust embedding fusion. Extensive experiments demonstrate the effectiveness and flexibility of the proposed method on 4D human-centric parsing tasks, achieving up to 93.3% acceleration compared to the previous state-of-the-art method, which was limited to parsing fixed classes.
Related papers
- Efficient Listener: Dyadic Facial Motion Synthesis via Action Diffusion [91.54433928140816]
We propose Facial Action Diffusion (FAD), which introduces the diffusion methods from the field of image generation to achieve efficient facial action generation.<n>We further build the Efficient Listener Network (ELNet) specially designed to accommodate both the visual and audio information of the speaker as input.<n>Considering of FAD and ELNet, the proposed method learns effective listener facial motion representations and leads to improvements of performance over the state-of-the-art methods.
arXiv Detail & Related papers (2025-04-29T12:08:02Z) - Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos [70.07088203106443]
Existing methods rely on explicit knowledge to learn motion, resulting in suboptimal representations.<n>Prior Masked Autoentangler (MAE) frameworks struggle to bridge the gap between low-level geometry and high-level dynamics in 4D data.<n>We propose a novel self-disentangled MAE for learning expressive,riminative, and transferable 4D representations.
arXiv Detail & Related papers (2025-04-07T08:47:36Z) - Open-Vocabulary Semantic Part Segmentation of 3D Human [4.380538063753977]
We present the first open-vocabulary segmentation method capable of handling 3D human.<n>Our framework can segment the human category into desired fine-grained parts based on the textual prompt.<n>Our method can be directly applied to various 3D representations including meshes, point clouds, and 3D Gaussian Splatting.
arXiv Detail & Related papers (2025-02-27T05:47:05Z) - Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion [13.938406073551844]
This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation.
To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views.
Our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements.
arXiv Detail & Related papers (2024-10-06T18:15:27Z) - FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models [40.966197115577344]
3D Human Pose Estimation task uses 2D images or videos to predict human joint coordinates in 3D space.
We present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE, named textbfFinePOSE.
It consists of three core blocks enhancing the reverse process of the diffusion model.
Experiments on public single-human pose datasets show that FinePOSE outperforms state-of-the-art methods.
arXiv Detail & Related papers (2024-05-08T17:09:03Z) - Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models [8.933560282929726]
We introduce a novel affordance representation, named Comprehensive Affordance (ComA)
Given a 3D object mesh, ComA models the distribution of relative orientation and proximity of vertices in interacting human meshes.
We demonstrate that ComA outperforms competitors that rely on human annotations in modeling contact-based affordance.
arXiv Detail & Related papers (2024-01-23T18:59:59Z) - A Unified Approach for Text- and Image-guided 4D Scene Generation [58.658768832653834]
We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis.
We show that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation.
Our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.
arXiv Detail & Related papers (2023-11-28T15:03:53Z) - Context-Aware Sequence Alignment using 4D Skeletal Augmentation [67.05537307224525]
Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality.
We propose a novel context-aware self-supervised learning architecture to align sequences of actions.
Specifically, CASA employs self-attention and cross-attention mechanisms to incorporate the spatial and temporal context of human actions.
arXiv Detail & Related papers (2022-04-26T10:59:29Z) - Magnifying Subtle Facial Motions for Effective 4D Expression Recognition [56.806738404887824]
The flow of 3D faces is first analyzed to capture the spatial deformations.
The obtained temporal evolution of these deformations are fed into a magnification method.
The latter, main contribution of this paper, allows revealing subtle (hidden) deformations which enhance the emotion classification performance.
arXiv Detail & Related papers (2021-05-05T20:47:43Z) - Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View
Geometry [62.29762409558553]
Epipolar constraints are at the core of feature matching and depth estimation in multi-person 3D human pose estimation methods.
Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances.
In this paper, we depart from the multi-person 3D pose estimation formulation, and instead reformulate it as crowd pose estimation.
arXiv Detail & Related papers (2020-07-21T17:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.