BGM2Pose: Active 3D Human Pose Estimation with Non-Stationary Sounds
- URL: http://arxiv.org/abs/2503.00389v1
- Date: Sat, 01 Mar 2025 07:32:19 GMT
- Title: BGM2Pose: Active 3D Human Pose Estimation with Non-Stationary Sounds
- Authors: Yuto Shibata, Yusuke Oumi, Go Irie, Akisato Kimura, Yoshimitsu Aoki, Mariko Isogawa,
- Abstract summary: BGM2Pose is a non-invasive 3D human pose estimation method using arbitrary music (e.g., background music) as active sensing signals.<n>Our method utilizes natural music that causes minimal discomfort to humans.
- Score: 16.0759003139539
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose BGM2Pose, a non-invasive 3D human pose estimation method using arbitrary music (e.g., background music) as active sensing signals. Unlike existing approaches that significantly limit practicality by employing intrusive chirp signals within the audible range, our method utilizes natural music that causes minimal discomfort to humans. Estimating human poses from standard music presents significant challenges. In contrast to sound sources specifically designed for measurement, regular music varies in both volume and pitch. These dynamic changes in signals caused by music are inevitably mixed with alterations in the sound field resulting from human motion, making it hard to extract reliable cues for pose estimation. To address these challenges, BGM2Pose introduces a Contrastive Pose Extraction Module that employs contrastive learning and hard negative sampling to eliminate musical components from the recorded data, isolating the pose information. Additionally, we propose a Frequency-wise Attention Module that enables the model to focus on subtle acoustic variations attributable to human movement by dynamically computing attention across frequency bands. Experiments suggest that our method outperforms the existing methods, demonstrating substantial potential for real-world applications. Our datasets and code will be made publicly available.
Related papers
- Perceptual Noise-Masking with Music through Deep Spectral Envelope Shaping [8.560397278656646]
People often listen to music in noisy environments, seeking to isolate themselves from ambient sounds.<n>We propose a neural network based on a psychoacoustic masking model to enhance the music's ability to mask ambient noise.<n>We evaluate our approach on simulated data replicating a user's experience of listening to music with headphones in a noisy environment.
arXiv Detail & Related papers (2025-02-24T07:58:10Z) - Acoustic-based 3D Human Pose Estimation Robust to Human Position [16.0759003139539]
The existing active acoustic sensing-based approach for 3D human pose estimation implicitly assumes that the target user is positioned along a line between loudspeakers and a microphone.
Because reflection and diffraction of sound by the human body cause subtle acoustic signal changes compared to sound obstruction, the existing model degrades its accuracy significantly when subjects deviate from this line.
To overcome this limitation, we propose a novel method composed of a position discriminator and reverberation-resistant model.
arXiv Detail & Related papers (2024-11-08T15:56:12Z) - Enhancing Sequential Music Recommendation with Personalized Popularity Awareness [56.972624411205224]
This paper introduces a novel approach that incorporates personalized popularity information into sequential recommendation.
Experimental results demonstrate that a Personalized Most Popular recommender outperforms existing state-of-the-art models.
arXiv Detail & Related papers (2024-09-06T15:05:12Z) - MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss [51.85076222868963]
We introduce a pre-training task designed to link control signals directly with corresponding musical tokens.
We then implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts.
arXiv Detail & Related papers (2024-07-05T08:08:22Z) - Quantifying Noise of Dynamic Vision Sensor [49.665407116447454]
Dynamic visual sensors (DVS) are characterised by a large amount of background activity (BA) noise.
It is difficult to distinguish between noise and the cleaned sensor signals using standard image processing techniques.
A new technique is presented to characterise BA noise derived from the Detrended Fluctuation Analysis (DFA)
arXiv Detail & Related papers (2024-04-02T13:43:08Z) - Music Auto-Tagging with Robust Music Representation Learned via Domain
Adversarial Training [18.71152526968065]
Existing models in Music Information Retrieval (MIR) struggle with real-world noise such as environmental and speech sounds in multimedia content.
This study proposes a method inspired by speech-related tasks to enhance music auto-tagging performance in noisy settings.
arXiv Detail & Related papers (2024-01-27T06:56:51Z) - DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation [89.50310360658791]
We present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation.
This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model.
We demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music.
arXiv Detail & Related papers (2023-08-05T16:18:57Z) - Generating music with sentiment using Transformer-GANs [0.0]
We propose a generative model of symbolic music conditioned by data retrieved from human sentiment.
We try to tackle both of the problems above by employing an efficient linear version of Attention and using a Discriminator.
arXiv Detail & Related papers (2022-12-21T15:59:35Z) - AIMusicGuru: Music Assisted Human Pose Correction [8.020211030279686]
We present a method that leverages our understanding of the high degree of a causal relationship between the sound produced and the motion that produces them.
We use the audio signature to refine and predict accurate human body pose motion models.
We also open-source MAPdat, a new multi-modal dataset of 3D violin playing motion with music.
arXiv Detail & Related papers (2022-03-24T03:16:42Z) - Active Audio-Visual Separation of Dynamic Sound Sources [93.97385339354318]
We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone.
We show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target.
arXiv Detail & Related papers (2022-02-02T02:03:28Z) - A Flow Base Bi-path Network for Cross-scene Video Crowd Understanding in
Aerial View [93.23947591795897]
In this paper, we strive to tackle the challenges and automatically understand the crowd from the visual data collected from drones.
To alleviate the background noise generated in cross-scene testing, a double-stream crowd counting model is proposed.
To tackle the crowd density estimation problem under extreme dark environments, we introduce synthetic data generated by game Grand Theft Auto V(GTAV)
arXiv Detail & Related papers (2020-09-29T01:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.