Related papers: Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos

Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos

URL: http://arxiv.org/abs/2303.16897v3
Date: Sat, 8 Jul 2023 07:12:28 GMT
Title: Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
Authors: Kun Su, Kaizhi Qian, Eli Shlizerman, Antonio Torralba, Chuang Gan
Abstract summary: Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. Existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds. We propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip.
Score: 78.49864987061689
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely available in the real world and can not be applied to synthesize impact sounds from common videos. On the other hand, existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds since they lack of physics knowledge. In this work, we propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip. In addition to the video content, we propose to use additional physics priors to guide the impact sound synthesis procedure. The physics priors include both physics parameters that are directly estimated from noisy real-world impact sound examples without sophisticated setup and learned residual parameters that interpret the sound environment via neural networks. We further implement a novel diffusion model with specific training and inference strategies to combine physics priors and visual information for impact sound synthesis. Experimental results show that our model outperforms several existing systems in generating realistic impact sounds. More importantly, the physics-based representations are fully interpretable and transparent, thus enabling us to perform sound editing flexibly.

Related papers

Sonify Anything: Towards Context-Aware Sonic Interactions in AR [38.82194569186157]
We propose a framework for context-aware sounds using methods from computer vision to recognize and segment the materials of real objects.<n>Results show that material-based sounds led to significantly more realistic sonic interactions.<n>These findings show that context-aware, material-based sonic interactions in AR foster a stronger sense of realism and enhance our perception of real-world surroundings.
arXiv Detail & Related papers (2025-08-03T14:56:56Z)
MOSPA: Human Motion Generation Driven by Spatial Audio [56.735282455483954]
We introduce the first comprehensive Spatial Audio-Driven Human Motion dataset, which contains diverse and high-quality spatial audio and motion data.<n>We develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA.<n>Once trained, MOSPA could generate diverse realistic human motions conditioned on varying spatial audio inputs.
arXiv Detail & Related papers (2025-07-16T06:33:11Z)
Teaching Physical Awareness to LLMs through Sounds [2.5260091444764554]
ACORN is a framework that teaches Large Language Models (LLMs) physical awareness through sound.<n>We build AQA-PHY, a comprehensive Audio Question-Answer dataset, and propose an audio encoder that processes both magnitude and phase information.<n>We demonstrate reasonable results in both simulated and real-world tasks, such as line-of-sight detection, Doppler effect estimation, and Direction-of-Arrival estimation.
arXiv Detail & Related papers (2025-06-10T07:42:56Z)
PhysGaia: A Physics-Aware Dataset of Multi-Body Interactions for Dynamic Novel View Synthesis [62.283499219361595]
PhysGaia is a physics-aware dataset specifically designed for Dynamic Novel View Synthesis (DyNVS)<n>Our dataset provides complex dynamic scenarios with rich interactions among multiple objects.<n>PhysGaia will significantly advance research in dynamic view synthesis, physics-based scene understanding, and deep learning models integrated with physical simulation.
arXiv Detail & Related papers (2025-06-03T12:19:18Z)
Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation [28.79821758835663]
We propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation.<n>Our method leverages large language models (LLMs) to explicitly reason a comprehensive physical context from the text prompt.<n>We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning.
arXiv Detail & Related papers (2025-05-27T18:26:43Z)
Synthetic Video Enhances Physical Fidelity in Video Synthesis [25.41774228022216]
We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos derived from computer graphics pipelines. We propose a solution that curates and integrates synthetic data while introducing a method to transfer its physical realism to the model. Our work offers one of the first empirical demonstrations that synthetic video enhances physical fidelity in video synthesis.
arXiv Detail & Related papers (2025-03-26T00:45:07Z)
The Sound of Water: Inferring Physical Properties from Pouring Liquids [85.30865788636386]
We study the connection between audio-visual observations and the underlying physics of pouring liquids. Our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill.
arXiv Detail & Related papers (2024-11-18T01:19:37Z)
Differentiable Physics-based System Identification for Robotic Manipulation of Elastoplastic Materials [43.99845081513279]
This work introduces a novel Differentiable Physics-based System Identification (DPSI) framework that enables a robot arm to infer the physics parameters of elastoplastic materials and the environment. With only a single real-world interaction, the estimated parameters can accurately simulate visually and physically realistic behaviours.
arXiv Detail & Related papers (2024-11-01T13:04:25Z)
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation [29.831214435147583]
We present PhysGen, a novel image-to-video generation method. It produces a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process.
arXiv Detail & Related papers (2024-09-27T17:59:57Z)
Latent Intuitive Physics: Learning to Transfer Hidden Physics from A 3D Video [58.043569985784806]
We introduce latent intuitive physics, a transfer learning framework for physics simulation. It can infer hidden properties of fluids from a single 3D video and simulate the observed fluid in novel scenes. We validate our model in three ways: (i) novel scene simulation with the learned visual-world physics, (ii) future prediction of the observed fluid dynamics, and (iii) supervised particle simulation.
arXiv Detail & Related papers (2024-06-18T16:37:44Z)
PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation [62.53760963292465]
PhysDreamer is a physics-based approach that endows static 3D objects with interactive dynamics. We present our approach on diverse examples of elastic objects and evaluate the realism of the synthesized interactions through a user study.
arXiv Detail & Related papers (2024-04-19T17:41:05Z)
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language [92.7638697243969]
We propose a unified framework that can jointly learn visual concepts and infer physics models of objects from videos and language. This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
arXiv Detail & Related papers (2021-10-28T17:59:13Z)
Physics-based Human Motion Estimation and Synthesis from Videos [0.0]
We propose a framework for training generative models of physically plausible human motion directly from monocular RGB videos. At the core of our method is a novel optimization formulation that corrects imperfect image-based pose estimations. Results show that our physically-corrected motions significantly outperform prior work on pose estimation.
arXiv Detail & Related papers (2021-09-21T01:57:54Z)
Visual Grounding of Learned Physical Models [66.04898704928517]
Humans intuitively recognize objects' physical properties and predict their motion, even when the objects are engaged in complicated interactions. We present a neural model that simultaneously reasons about physics and makes future predictions based on visual and dynamics priors. Experiments show that our model can infer the physical properties within a few observations, which allows the model to quickly adapt to unseen scenarios and make accurate predictions into the future.
arXiv Detail & Related papers (2020-04-28T17:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.