AIMusicGuru: Music Assisted Human Pose Correction
- URL: http://arxiv.org/abs/2203.12829v1
- Date: Thu, 24 Mar 2022 03:16:42 GMT
- Title: AIMusicGuru: Music Assisted Human Pose Correction
- Authors: Snehesh Shrestha, Cornelia Ferm\"uller, Tianyu Huang, Pyone Thant Win,
Adam Zukerman, Chethan M. Parameshwara, Yiannis Aloimonos
- Abstract summary: We present a method that leverages our understanding of the high degree of a causal relationship between the sound produced and the motion that produces them.
We use the audio signature to refine and predict accurate human body pose motion models.
We also open-source MAPdat, a new multi-modal dataset of 3D violin playing motion with music.
- Score: 8.020211030279686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pose Estimation techniques rely on visual cues available through observations
represented in the form of pixels. But the performance is bounded by the frame
rate of the video and struggles from motion blur, occlusions, and temporal
coherence. This issue is magnified when people are interacting with objects and
instruments, for example playing the violin. Standard approaches for
postprocessing use interpolation and smoothing functions to filter noise and
fill gaps, but they cannot model highly non-linear motion. We present a method
that leverages our understanding of the high degree of a causal relationship
between the sound produced and the motion that produces them. We use the audio
signature to refine and predict accurate human body pose motion models. We
propose MAPnet (Music Assisted Pose network) for generating a fine grain motion
model from sparse input pose sequences but continuous audio. To accelerate
further research in this domain, we also open-source MAPdat, a new multi-modal
dataset of 3D violin playing motion with music. We perform a comparison of
different standard machine learning models and perform analysis on input
modalities, sampling techniques, and audio and motion features. Experiments on
MAPdat suggest multi-modal approaches like ours as a promising direction for
tasks previously approached with visual methods only. Our results show both
qualitatively and quantitatively how audio can be combined with visual
observation to help improve any pose estimation methods.
Related papers
- VioPose: Violin Performance 4D Pose Estimation by Hierarchical Audiovisual Inference [7.5565058831496055]
Current state-of-the-art visual pose estimation algorithms struggle to produce accurate monocular 4D poses.
We propose VioPose: a novel multimodal network that hierarchically estimates dynamics.
Our architecture is shown to produce accurate pose sequences, facilitating precise motion analysis, and outperforms SoTA.
arXiv Detail & Related papers (2024-11-19T20:57:15Z) - Tracking Everything Everywhere All at Once [111.00807055441028]
We present a new test-time optimization method for estimating dense and long-range motion from a video sequence.
We propose a complete and globally consistent motion representation, dubbed OmniMotion.
Our approach outperforms prior state-of-the-art methods by a large margin both quantitatively and qualitatively.
arXiv Detail & Related papers (2023-06-08T17:59:29Z) - Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion
Models [22.000197530493445]
We show that diffusion models are an excellent fit for synthesising human motion that co-occurs with audio.
We adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power.
Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality.
arXiv Detail & Related papers (2022-11-17T17:41:00Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Render In-between: Motion Guided Video Synthesis for Action
Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance.
A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset.
Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z) - PIRenderer: Controllable Portrait Image Generation via Semantic Neural
Rendering [56.762094966235566]
A Portrait Image Neural Renderer is proposed to control the face motions with the parameters of three-dimensional morphable face models.
The proposed model can generate photo-realistic portrait images with accurate movements according to intuitive modifications.
Our model can generate coherent videos with convincing movements from only a single reference image and a driving audio stream.
arXiv Detail & Related papers (2021-09-17T07:24:16Z) - Audio2Gestures: Generating Diverse Gestures from Speech Audio with
Conditional Variational Autoencoders [29.658535633701035]
We propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping.
We show that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively.
arXiv Detail & Related papers (2021-08-15T11:15:51Z) - Strumming to the Beat: Audio-Conditioned Contrastive Video Textures [112.6140796961121]
We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning.
We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order.
Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.
arXiv Detail & Related papers (2021-04-06T17:24:57Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.