PAVAS: Physics-Aware Video-to-Audio Synthesis
- URL: http://arxiv.org/abs/2512.08282v1
- Date: Tue, 09 Dec 2025 06:28:50 GMT
- Title: PAVAS: Physics-Aware Video-to-Audio Synthesis
- Authors: Oh Hyun-Bin, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji,
- Abstract summary: We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation.<n>We show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations.
- Score: 58.746986798623084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io for demo videos.
Related papers
- PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation [63.3417467957431]
Text-to-audio-video (T2AV) generation underpins a wide range of applications demanding realistic audio-visual content.<n>We present PhyAVBench, a challenging audio physics-sensitivity benchmark designed to evaluate the audio physics grounding capabilities of existing T2AV models.<n>Unlike prior benchmarks that primarily focus on audio-video synchronization, PhyAVBench explicitly evaluates models' understanding of the physical mechanisms underlying sound generation.
arXiv Detail & Related papers (2025-12-30T05:22:31Z) - PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection [10.498184571108995]
We propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation.<n>Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions.<n>Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones.
arXiv Detail & Related papers (2025-11-06T02:40:57Z) - Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models [14.187604603759784]
We present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of text-to-video systems.<n>For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline.<n> PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
arXiv Detail & Related papers (2025-07-21T17:30:46Z) - Think Before You Diffuse: Infusing Physical Rules into Video Diffusion [55.046699347579455]
The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data.<n>We propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model.
arXiv Detail & Related papers (2025-05-27T18:26:43Z) - ContPhy: Continuum Physical Concept Learning and Reasoning from Videos [86.63174804149216]
ContPhy is a novel benchmark for assessing machine physical commonsense.
We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy.
We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models.
arXiv Detail & Related papers (2024-02-09T01:09:21Z) - Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos [78.49864987061689]
Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound.
Existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds.
We propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip.
arXiv Detail & Related papers (2023-03-29T17:59:53Z) - Dynamic Visual Reasoning by Learning Differentiable Physics Models from
Video and Language [92.7638697243969]
We propose a unified framework that can jointly learn visual concepts and infer physics models of objects from videos and language.
This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
arXiv Detail & Related papers (2021-10-28T17:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.