FürElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance
- URL: http://arxiv.org/abs/2410.05791v1
- Date: Tue, 8 Oct 2024 08:21:05 GMT
- Title: FürElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance
- Authors: Ruocheng Wang, Pei Xu, Haochen Shi, Elizabeth Schumann, C. Karen Liu,
- Abstract summary: Hand motion models with the sophistication to accurately recreate piano playing have a wide range of applications in character animation, embodied AI, biomechanics, and VR/AR.
In this paper, we construct a first-of-its-kind large-scale dataset that contains approximately 10 hours of 3D hand motion and audio from 15 elite-level pianists playing 153 pieces of classical music.
- Score: 15.909113091360206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Piano playing requires agile, precise, and coordinated hand control that stretches the limits of dexterity. Hand motion models with the sophistication to accurately recreate piano playing have a wide range of applications in character animation, embodied AI, biomechanics, and VR/AR. In this paper, we construct a first-of-its-kind large-scale dataset that contains approximately 10 hours of 3D hand motion and audio from 15 elite-level pianists playing 153 pieces of classical music. To capture natural performances, we designed a markerless setup in which motions are reconstructed from multi-view videos using state-of-the-art pose estimation models. The motion data is further refined via inverse kinematics using the high-resolution MIDI key-pressing data obtained from sensors in a specialized Yamaha Disklavier piano. Leveraging the collected dataset, we developed a pipeline that can synthesize physically-plausible hand motions for musical scores outside of the dataset. Our approach employs a combination of imitation learning and reinforcement learning to obtain policies for physics-based bimanual control involving the interaction between hands and piano keys. To solve the sampling efficiency problem with the large motion dataset, we use a diffusion model to generate natural reference motions, which provide high-level trajectory and fingering (finger order and placement) information. However, the generated reference motion alone does not provide sufficient accuracy for piano performance modeling. We then further augmented the data by using musical similarity to retrieve similar motions from the captured dataset to boost the precision of the RL policy. With the proposed method, our model generates natural, dexterous motions that generalize to music from outside the training dataset.
Related papers
- UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities.
By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z) - RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands [57.64308229980045]
We introduce the Robot Piano 1 Million dataset, containing bi-manual robot piano playing motion data of more than one million trajectories.
We formulate finger placements as an optimal transport problem, thus, enabling automatic annotation of vast amounts of unlabeled songs.
Benchmarking existing imitation learning approaches shows that such approaches reach state-of-the-art robot piano playing performance by leveraging RP1M.
arXiv Detail & Related papers (2024-08-20T17:56:52Z) - PianoMotion10M: Dataset and Benchmark for Hand Motion Generation in Piano Performance [15.21347897534943]
We construct a piano-hand motion generation benchmark to guide hand movements and fingerings for piano playing.
To this end, we collect an annotated dataset, PianoMotion10M, consisting of 116 hours of piano playing videos from a bird's-eye view with 10 million annotated hand poses.
arXiv Detail & Related papers (2024-06-13T17:05:23Z) - HMP: Hand Motion Priors for Pose and Shape Estimation from Video [52.39020275278984]
We develop a generative motion prior specific for hands, trained on the AMASS dataset which features diverse and high-quality hand motions.
Our integration of a robust motion prior significantly enhances performance, especially in occluded scenarios.
We demonstrate our method's efficacy via qualitative and quantitative evaluations on the HO3D and DexYCB datasets.
arXiv Detail & Related papers (2023-12-27T22:35:33Z) - BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis [123.73677487809418]
We introduce a new dataset aiming to challenge common assumptions in dance motion synthesis.
We focus on breakdancing which features acrobatic moves and tangled postures.
Our efforts produced the BRACE dataset, which contains over 3 hours and 30 minutes of densely annotated poses.
arXiv Detail & Related papers (2022-07-20T18:03:54Z) - AIMusicGuru: Music Assisted Human Pose Correction [8.020211030279686]
We present a method that leverages our understanding of the high degree of a causal relationship between the sound produced and the motion that produces them.
We use the audio signature to refine and predict accurate human body pose motion models.
We also open-source MAPdat, a new multi-modal dataset of 3D violin playing motion with music.
arXiv Detail & Related papers (2022-03-24T03:16:42Z) - Render In-between: Motion Guided Video Synthesis for Action
Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance.
A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset.
Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z) - Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation.
A core challenge is to generalize the manipulation skills to objects in different locations.
We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z) - Learning Style-Aware Symbolic Music Representations by Adversarial
Autoencoders [9.923470453197657]
We focus on leveraging adversarial regularization as a flexible and natural mean to imbue variational autoencoders with context information.
We introduce the first Music Adversarial Autoencoder (MusAE)
Our model has a higher reconstruction accuracy than state-of-the-art models based on standard variational autoencoders.
arXiv Detail & Related papers (2020-01-15T18:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.