Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine
- URL: http://arxiv.org/abs/2602.07064v1
- Date: Thu, 05 Feb 2026 14:04:51 GMT
- Title: Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine
- Authors: Minghao Han, Dingkang Yang, Yue Jiang, Yizhou Liu, Lihua Zhang,
- Abstract summary: We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text.<n>To inject explicit physical knowledge, we build a physical data engine with two components.<n>Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.
- Score: 50.62040226184694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.
Related papers
- QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery [12.888415301529891]
Characterizing two-dimensional quantum materials from optical microscopy images is challenging due to subtle layer-dependent contrast, limited labeled data, and significant variation across laboratories and imaging setups.<n>This work presents a new physics-aware multimodal framework that addresses these limitations from both the data and model perspectives.<n>We first present Synthia, a physics-based synthetic data generator that simulates realistic optical responses of quantum material flakes under thin-film interference.<n>We introduce QMat-Instruct, the first large-scale instruction dataset for quantum materials, comprising multimodal, physics-informed question-answer pairs designed to teach Multimodal Large Language Models (ML
arXiv Detail & Related papers (2026-02-19T15:44:41Z) - PAVAS: Physics-Aware Video-to-Audio Synthesis [58.746986798623084]
We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation.<n>We show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-12-09T06:28:50Z) - VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning [17.790063818997975]
VibraVerse is a large-scale dataset that bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals.<n> CLASP is a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response.<n>Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning.
arXiv Detail & Related papers (2025-11-25T15:48:49Z) - PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection [10.498184571108995]
We propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation.<n>Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions.<n>Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones.
arXiv Detail & Related papers (2025-11-06T02:40:57Z) - Inferring Dynamic Physical Properties from Video Foundation Models [94.35979242947873]
We study the task of predicting dynamic physical properties from videos.<n>We consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface.
arXiv Detail & Related papers (2025-10-02T17:59:50Z) - PhysID: Physics-based Interactive Dynamics from a Single-view Image [1.7214450148288793]
We present PhysID, that streamlines the creation of physics-based interactive dynamics from a single-view image.<n>We integrate an on-device physics-based engine for physically plausible real-time rendering with user interactions.
arXiv Detail & Related papers (2025-06-21T15:57:58Z) - PhysGaia: A Physics-Aware Dataset of Multi-Body Interactions for Dynamic Novel View Synthesis [62.283499219361595]
PhysGaia is a physics-aware dataset specifically designed for Dynamic Novel View Synthesis (DyNVS)<n>Our dataset provides complex dynamic scenarios with rich interactions among multiple objects.<n>PhysGaia will significantly advance research in dynamic view synthesis, physics-based scene understanding, and deep learning models integrated with physical simulation.
arXiv Detail & Related papers (2025-06-03T12:19:18Z) - Think Before You Diffuse: Infusing Physical Rules into Video Diffusion [55.046699347579455]
The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data.<n>We propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model.
arXiv Detail & Related papers (2025-05-27T18:26:43Z) - ContPhy: Continuum Physical Concept Learning and Reasoning from Videos [86.63174804149216]
ContPhy is a novel benchmark for assessing machine physical commonsense.
We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy.
We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models.
arXiv Detail & Related papers (2024-02-09T01:09:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.