Related papers: VioPose: Violin Performance 4D Pose Estimation by Hierarchical Audiovisual Inference

VioPose: Violin Performance 4D Pose Estimation by Hierarchical Audiovisual Inference

URL: http://arxiv.org/abs/2411.13607v2
Date: Mon, 25 Nov 2024 05:14:20 GMT
Title: VioPose: Violin Performance 4D Pose Estimation by Hierarchical Audiovisual Inference
Authors: Seong Jong Yoo, Snehesh Shrestha, Irina Muresanu, Cornelia Fermüller,
Abstract summary: Current state-of-the-art (SoTA) visual pose estimation algorithms struggle to produce accurate monocular 4D poses.<n>We propose VioPose: a novel multimodal network that hierarchically estimates dynamics.<n>Our architecture is shown to produce accurate pose sequences, facilitating precise motion analysis, and outperforms SoTA.
Score: 7.5565058831496055
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Musicians delicately control their bodies to generate music. Sometimes, their motions are too subtle to be captured by the human eye. To analyze how they move to produce the music, we need to estimate precise 4D human pose (3D pose over time). However, current state-of-the-art (SoTA) visual pose estimation algorithms struggle to produce accurate monocular 4D poses because of occlusions, partial views, and human-object interactions. They are limited by the viewing angle, pixel density, and sampling rate of the cameras and fail to estimate fast and subtle movements, such as in the musical effect of vibrato. We leverage the direct causal relationship between the music produced and the human motions creating them to address these challenges. We propose VioPose: a novel multimodal network that hierarchically estimates dynamics. High-level features are cascaded to low-level features and integrated into Bayesian updates. Our architecture is shown to produce accurate pose sequences, facilitating precise motion analysis, and outperforms SoTA. As part of this work, we collected the largest and the most diverse calibrated violin-playing dataset, including video, sound, and 3D motion capture poses. Code and dataset can be found in our project page \url{https://sj-yoo.info/viopose/}.

Related papers

X-Dancer: Expressive Music to Human Dance Video Generation [26.544761204917336]
X-Dancer is a novel zero-shot music-driven image animation pipeline. It creates diverse and long-range lifelike human dance videos from a single static image.
arXiv Detail & Related papers (2025-02-24T18:47:54Z)
DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos [21.93514516437402]
We present DreamScene4D, the first approach to generate 3D dynamic scenes of multiple objects from monocular videos via novel view synthesis. Our key insight is a "decompose-recompose" approach that factorizes the video scene into the background and object tracks. We show extensive results on challenging DAVIS, Kubric, and self-captured videos with quantitative comparisons and a user preference study.
arXiv Detail & Related papers (2024-05-03T17:55:34Z)
DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance [50.01162760878841]
We present DCM, a new multi-modal 3D dataset that combines camera movement with dance motion and music audio. This dataset encompasses 108 dance sequences (3.2 hours) of paired dance-camera-music data from the anime community. We propose DanceCamera3D, a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy.
arXiv Detail & Related papers (2024-03-20T15:24:57Z)
DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos. Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion. Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z)
Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image [59.18564636990079]
We study the problem of synthesizing a long-term dynamic video from only a single image. Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories. We present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image.
arXiv Detail & Related papers (2023-08-20T12:53:50Z)
BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis [123.73677487809418]
We introduce a new dataset aiming to challenge common assumptions in dance motion synthesis. We focus on breakdancing which features acrobatic moves and tangled postures. Our efforts produced the BRACE dataset, which contains over 3 hours and 30 minutes of densely annotated poses.
arXiv Detail & Related papers (2022-07-20T18:03:54Z)
3D Moments from Near-Duplicate Photos [67.15199743223332]
3D Moments is a new computational photography effect. We produce a video that smoothly interpolates the scene motion from the first photo to the second. Our system produces photorealistic space-time videos with motion parallax and scene dynamics.
arXiv Detail & Related papers (2022-05-12T17:56:18Z)
AIMusicGuru: Music Assisted Human Pose Correction [8.020211030279686]
We present a method that leverages our understanding of the high degree of a causal relationship between the sound produced and the motion that produces them. We use the audio signature to refine and predict accurate human body pose motion models. We also open-source MAPdat, a new multi-modal dataset of 3D violin playing motion with music.
arXiv Detail & Related papers (2022-03-24T03:16:42Z)
NeuralDiff: Segmenting 3D objects that move in egocentric videos [92.95176458079047]
We study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground. This task is reminiscent of the classic background subtraction problem, but is significantly harder because all parts of the scene, static and dynamic, generate a large apparent motion. In particular, we consider egocentric videos and further separate the dynamic component into objects and the actor that observes and moves them.
arXiv Detail & Related papers (2021-10-19T12:51:35Z)
Unsupervised 3D Pose Estimation for Hierarchical Dance Video Recognition [13.289339907084424]
We propose a Hierarchical Dance Video Recognition framework (HDVR) HDVR estimates 2D pose sequences, tracks dancers, and then simultaneously estimates corresponding 3D poses and 3D-to-2D imaging parameters. From the estimated 3D pose sequence, HDVR extracts body part movements, and therefrom dance genre.
arXiv Detail & Related papers (2021-09-19T16:59:37Z)
Deep 3D Mask Volume for View Synthesis of Dynamic Scenes [49.45028543279115]
We introduce a multi-view video dataset, captured with a custom 10-camera rig in 120FPS. The dataset contains 96 high-quality scenes showing various visual effects and human interactions in outdoor scenes. We develop a new algorithm, Deep 3D Mask Volume, which enables temporally-stable view extrapolation from binocular videos of dynamic scenes, captured by static cameras.
arXiv Detail & Related papers (2021-08-30T17:55:28Z)
Learning Motion Priors for 4D Human Body Capture in 3D Scenes [81.54377747405812]
We propose LEMO: LEarning human MOtion priors for 4D human body capture. We introduce a novel motion prior, which reduces the jitters exhibited by poses recovered over a sequence. We also design a contact friction term and a contact-aware motion infiller obtained via per-instance self-supervised training. With our pipeline, we demonstrate high-quality 4D human body capture, reconstructing smooth motions and physically plausible body-scene interactions.
arXiv Detail & Related papers (2021-08-23T20:47:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.