Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation
- URL: http://arxiv.org/abs/2407.14062v1
- Date: Fri, 19 Jul 2024 06:41:16 GMT
- Title: Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation
- Authors: Zhe Zhao, Mengshi Qi, Huadong Ma,
- Abstract summary: We propose a novel Decomposed Vector-Quantized Variational Autoencoder (DVQ-VAE) to generate realistic human grasps.
Part-aware decomposed architecture facilitates more precise management of the interaction between each component of hand and object.
Our model achieved about 14.1% relative improvement in the quality index compared to the state-of-the-art methods in four widely-adopted benchmarks.
- Score: 27.206656215734295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating realistic human grasps is a crucial yet challenging task for applications involving object manipulation in computer graphics and robotics. Existing methods often struggle with generating fine-grained realistic human grasps that ensure all fingers effectively interact with objects, as they focus on encoding hand with the whole representation and then estimating both hand posture and position in a single step. In this paper, we propose a novel Decomposed Vector-Quantized Variational Autoencoder (DVQ-VAE) to address this limitation by decomposing hand into several distinct parts and encoding them separately. This part-aware decomposed architecture facilitates more precise management of the interaction between each component of hand and object, enhancing the overall reality of generated human grasps. Furthermore, we design a newly dual-stage decoding strategy, by first determining the type of grasping under skeletal physical constraints, and then identifying the location of the grasp, which can greatly improve the verisimilitude as well as adaptability of the model to unseen hand-object interaction. In experiments, our model achieved about 14.1% relative improvement in the quality index compared to the state-of-the-art methods in four widely-adopted benchmarks. Our source code is available at https://github.com/florasion/D-VQVAE.
Related papers
- HandDAGT: A Denoising Adaptive Graph Transformer for 3D Hand Pose Estimation [15.606904161622017]
This paper proposes the Denoising Adaptive Graph Transformer, HandDAGT, for hand pose estimation.
It incorporates a novel attention mechanism to adaptively weigh the contribution of kinematic correspondence and local geometric features for the estimation of specific keypoints.
Experimental results show that the proposed model significantly outperforms the existing methods on four challenging hand pose benchmark datasets.
arXiv Detail & Related papers (2024-07-30T04:53:35Z) - GEARS: Local Geometry-aware Hand-object Interaction Synthesis [38.75942505771009]
We introduce a novel joint-centered sensor designed to reason about local object geometry near potential interaction regions.
As an important step towards mitigating the learning complexity, we transform the points from global frame to template hand frame and use a shared module to process sensor features of each individual joint.
This is followed by a perceptual-temporal transformer network aimed at capturing correlation among the joints in different dimensions.
arXiv Detail & Related papers (2024-04-02T09:18:52Z) - Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for
Enhanced Human Pose Estimation with Sparse Inertial Sensors [17.3834029178939]
This paper introduces a novel human pose estimation approach using sparse inertial sensors.
It leverages a diverse array of real inertial motion capture data from different skeleton formats to improve motion diversity and model generalization.
The approach demonstrates superior performance over state-of-the-art models across five public datasets, notably reducing pose error by 19% on the DIP-IMU dataset.
arXiv Detail & Related papers (2023-12-02T13:17:10Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - Fast and Expressive Gesture Recognition using a Combination-Homomorphic
Electromyogram Encoder [21.25126610043744]
We study the task of gesture recognition from electromyography (EMG)
We define combination gestures consisting of a direction component and a modifier component.
New subjects only demonstrate the single component gestures.
We extrapolate to unseen combination gestures by combining the feature vectors of real single gestures to produce synthetic training data.
arXiv Detail & Related papers (2023-10-30T20:03:34Z) - A Multi-label Classification Approach to Increase Expressivity of
EMG-based Gesture Recognition [4.701158597171363]
The aim of this study is to efficiently increase the expressivity of surface electromyography-based (sEMG) gesture recognition systems.
We use a problem transformation approach, in which actions were subset into two biomechanically independent components.
arXiv Detail & Related papers (2023-09-13T20:21:41Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Interacting Hand-Object Pose Estimation via Dense Mutual Attention [97.26400229871888]
3D hand-object pose estimation is the key to the success of many computer vision applications.
We propose a novel dense mutual attention mechanism that is able to model fine-grained dependencies between the hand and the object.
Our method is able to produce physically plausible poses with high quality and real-time inference speed.
arXiv Detail & Related papers (2022-11-16T10:01:33Z) - TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions.
We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data.
We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z) - Real-time Pose and Shape Reconstruction of Two Interacting Hands With a
Single Depth Camera [79.41374930171469]
We present a novel method for real-time pose and shape reconstruction of two strongly interacting hands.
Our approach combines an extensive list of favorable properties, namely it is marker-less.
We show state-of-the-art results in scenes that exceed the complexity level demonstrated by previous work.
arXiv Detail & Related papers (2021-06-15T11:39:49Z) - HMOR: Hierarchical Multi-Person Ordinal Relations for Monocular
Multi-Person 3D Pose Estimation [54.23770284299979]
This paper introduces a novel form of supervision - Hierarchical Multi-person Ordinal Relations (HMOR)
HMOR encodes interaction information as the ordinal relations of depths and angles hierarchically.
An integrated top-down model is designed to leverage these ordinal relations in the learning process.
The proposed method significantly outperforms state-of-the-art methods on publicly available multi-person 3D pose datasets.
arXiv Detail & Related papers (2020-08-01T07:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.