Related papers: Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

URL: http://arxiv.org/abs/2602.07064v1
Date: Thu, 05 Feb 2026 14:04:51 GMT
Title: Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine
Authors: Minghao Han, Dingkang Yang, Yue Jiang, Yizhou Liu, Lihua Zhang,
Abstract summary: We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text.<n>To inject explicit physical knowledge, we build a physical data engine with two components.<n>Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.
Score: 50.62040226184694
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.

Related papers

QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery [12.888415301529891]
Characterizing two-dimensional quantum materials from optical microscopy images is challenging due to subtle layer-dependent contrast, limited labeled data, and significant variation across laboratories and imaging setups.<n>This work presents a new physics-aware multimodal framework that addresses these limitations from both the data and model perspectives.<n>We first present Synthia, a physics-based synthetic data generator that simulates realistic optical responses of quantum material flakes under thin-film interference.<n>We introduce QMat-Instruct, the first large-scale instruction dataset for quantum materials, comprising multimodal, physics-informed question-answer pairs designed to teach Multimodal Large Language Models (ML
arXiv Detail & Related papers (2026-02-19T15:44:41Z)
PAVAS: Physics-Aware Video-to-Audio Synthesis [58.746986798623084]
We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation.<n>We show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-12-09T06:28:50Z)
VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning [17.790063818997975]
VibraVerse is a large-scale dataset that bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals.<n> CLASP is a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response.<n>Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning.
arXiv Detail & Related papers (2025-11-25T15:48:49Z)
PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection [10.498184571108995]
We propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation.<n>Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions.<n>Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones.
arXiv Detail & Related papers (2025-11-06T02:40:57Z)
Inferring Dynamic Physical Properties from Video Foundation Models [94.35979242947873]
We study the task of predicting dynamic physical properties from videos.<n>We consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface.
arXiv Detail & Related papers (2025-10-02T17:59:50Z)
PhysID: Physics-based Interactive Dynamics from a Single-view Image [1.7214450148288793]
We present PhysID, that streamlines the creation of physics-based interactive dynamics from a single-view image.<n>We integrate an on-device physics-based engine for physically plausible real-time rendering with user interactions.
arXiv Detail & Related papers (2025-06-21T15:57:58Z)
PhysGaia: A Physics-Aware Dataset of Multi-Body Interactions for Dynamic Novel View Synthesis [62.283499219361595]
PhysGaia is a physics-aware dataset specifically designed for Dynamic Novel View Synthesis (DyNVS)<n>Our dataset provides complex dynamic scenarios with rich interactions among multiple objects.<n>PhysGaia will significantly advance research in dynamic view synthesis, physics-based scene understanding, and deep learning models integrated with physical simulation.
arXiv Detail & Related papers (2025-06-03T12:19:18Z)
Think Before You Diffuse: Infusing Physical Rules into Video Diffusion [55.046699347579455]
The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data.<n>We propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model.
arXiv Detail & Related papers (2025-05-27T18:26:43Z)
ContPhy: Continuum Physical Concept Learning and Reasoning from Videos [86.63174804149216]
ContPhy is a novel benchmark for assessing machine physical commonsense. We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy. We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models.
arXiv Detail & Related papers (2024-02-09T01:09:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.