Inferring Dynamic Physical Properties from Video Foundation Models
- URL: http://arxiv.org/abs/2510.02311v1
- Date: Thu, 02 Oct 2025 17:59:50 GMT
- Title: Inferring Dynamic Physical Properties from Video Foundation Models
- Authors: Guanqi Zhan, Xianzheng Ma, Weidi Xie, Andrew Zisserman,
- Abstract summary: We study the task of predicting dynamic physical properties from videos.<n>We consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface.
- Score: 94.35979242947873
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.
Related papers
- PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement [45.990473754456104]
Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks.<n>We propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs.<n>We show that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks.
arXiv Detail & Related papers (2025-12-04T07:28:56Z) - Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph [29.737059125885057]
Video-STR achieves state-the-art results on various benchmarks, outperforming the base model by 13% on ML-Bench.<n>Code, model, and data will be released.
arXiv Detail & Related papers (2025-10-13T03:26:56Z) - Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation [55.046699347579455]
We propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation.<n>Our method leverages large language models (LLMs) to explicitly reason a comprehensive physical context from the text prompt.<n>We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning.
arXiv Detail & Related papers (2025-05-27T18:26:43Z) - PhyMAGIC: Physical Motion-Aware Generative Inference with Confidence-guided LLM [17.554471769834453]
We present PhyMAGIC, a training-free framework that generates physically consistent motion from a single image.<n>PhyMAGIC integrates a pre-trained image-to-video diffusion model, confidence-guided reasoning via LLMs, and a differentiable physics simulator.<n> Comprehensive experiments demonstrate that PhyMAGIC outperforms state-of-the-art video generators and physics-aware baselines.
arXiv Detail & Related papers (2025-05-22T09:40:34Z) - Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models [9.474337395173388]
Physical reasoning remains a significant challenge for Vision-Language Models (VLMs)<n>Fine-tuning is expensive for large models and impractical to repeatedly perform for every task.<n>We introduce Physics Context Builders (PCBs), a novel modular framework where specialized VLMs are fine-tuned to generate detailed physical scene descriptions.
arXiv Detail & Related papers (2024-12-11T18:40:16Z) - Efficient Physics Simulation for 3D Scenes via MLLM-Guided Gaussian Splatting [32.846428862045634]
We present Sim Anything, a physics-based approach that endows static 3D objects with interactive dynamics.<n>Inspired by human visual reasoning, we propose MLLM-based Physical Property Perception.<n>We also simulate objects in an open-world scene with particles sampled via the Physical-Geometric Adaptive Sampling.
arXiv Detail & Related papers (2024-11-19T12:52:21Z) - The Sound of Water: Inferring Physical Properties from Pouring Liquids [85.30865788636386]
We study the connection between audio-visual observations and the underlying physics of pouring liquids.<n>Our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill.
arXiv Detail & Related papers (2024-11-18T01:19:37Z) - OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics [22.119612406160073]
We present OCK, a dynamic video prediction model leveraging object-centric kinematics and object slots.<n>We introduce a novel component named Object Kinematics that comprises explicit object motions.<n>Our model demonstrates superior performance in complex scenes with intricate object attributes and motions.
arXiv Detail & Related papers (2024-04-29T04:47:23Z) - EDO-Net: Learning Elastic Properties of Deformable Objects from Graph Dynamics [24.33743287768859]
We study the problem of learning graph dynamics of deformable objects that generalizes to unknown physical properties.<n>We propose EDO-Net, a model of graph dynamics trained on a variety of samples with different elastic properties.
arXiv Detail & Related papers (2022-09-19T13:20:19Z) - Neural Implicit Representations for Physical Parameter Inference from a Single Video [49.766574469284485]
We propose to combine neural implicit representations for appearance modeling with neural ordinary differential equations (ODEs) for modelling physical phenomena.
Our proposed model combines several unique advantages: (i) Contrary to existing approaches that require large training datasets, we are able to identify physical parameters from only a single video.
The use of neural implicit representations enables the processing of high-resolution videos and the synthesis of photo-realistic images.
arXiv Detail & Related papers (2022-04-29T11:55:35Z) - Which priors matter? Benchmarking models for learning latent dynamics [70.88999063639146]
Several methods have proposed to integrate priors from classical mechanics into machine learning models.
We take a sober look at the current capabilities of these models.
We find that the use of continuous and time-reversible dynamics benefits models of all classes.
arXiv Detail & Related papers (2021-11-09T23:48:21Z) - Learning Local Recurrent Models for Human Mesh Recovery [50.85467243778406]
We present a new method for video mesh recovery that divides the human mesh into several local parts following the standard skeletal model.
We then model the dynamics of each local part with separate recurrent models, with each model conditioned appropriately based on the known kinematic structure of the human body.
This results in a structure-informed local recurrent learning architecture that can be trained in an end-to-end fashion with available annotations.
arXiv Detail & Related papers (2021-07-27T14:30:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.