VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning
- URL: http://arxiv.org/abs/2511.20422v1
- Date: Tue, 25 Nov 2025 15:48:49 GMT
- Title: VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning
- Authors: Bo Pang, Chenxi Xu, Jierui Ren, Guoping Wang, Sheng Li,
- Abstract summary: VibraVerse is a large-scale dataset that bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals.<n> CLASP is a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response.<n>Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning.
- Score: 17.790063818997975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object's geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young's modulus, Poisson's ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.
Related papers
- Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles [74.32932832937618]
We introduce $textbfRigidSSL$ ($textitRigidity-Aware Self-Supervised Learning$), a geometric pretraining framework.<n>Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations.<n>Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions.
arXiv Detail & Related papers (2026-03-02T21:32:30Z) - Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine [50.62040226184694]
We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text.<n>To inject explicit physical knowledge, we build a physical data engine with two components.<n>Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.
arXiv Detail & Related papers (2026-02-05T14:04:51Z) - Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects [59.51185639557874]
We introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions.<n>Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry.
arXiv Detail & Related papers (2025-11-03T07:21:42Z) - Inferring Dynamic Physical Properties from Video Foundation Models [94.35979242947873]
We study the task of predicting dynamic physical properties from videos.<n>We consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface.
arXiv Detail & Related papers (2025-10-02T17:59:50Z) - Optimizing Speech Language Models for Acoustic Consistency [2.5864269455844484]
We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech.<n>Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives.
arXiv Detail & Related papers (2025-09-30T13:59:52Z) - Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture [0.0]
This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics.<n>The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets.
arXiv Detail & Related papers (2025-04-25T05:57:22Z) - The Sound of Water: Inferring Physical Properties from Pouring Liquids [85.30865788636386]
We study the connection between audio-visual observations and the underlying physics of pouring liquids.<n>Our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill.
arXiv Detail & Related papers (2024-11-18T01:19:37Z) - Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation [17.03776191787701]
We introduce a novel model for simulating motion properties of nonlinear strings.
We integrate modal synthesis and spectral modeling within physical network framework.
Empirical evaluations demonstrate that the architecture achieves superior accuracy in string motion simulation.
arXiv Detail & Related papers (2024-07-07T23:36:51Z) - Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix
Factorization via Plastic Transformer [11.91784203088159]
We develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms.
Our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models.
arXiv Detail & Related papers (2023-09-26T00:21:17Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data.
Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.