Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection
- URL: http://arxiv.org/abs/2503.03562v3
- Date: Wed, 26 Mar 2025 03:58:26 GMT
- Title: Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection
- Authors: Wenqiao Li, Yao Gu, Xintao Chen, Xiaohao Xu, Ming Hu, Xiaonan Huang, Yingna Wu,
- Abstract summary: Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge.<n>Phys-AD is the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection.<n>The dataset includes more than 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies.
- Score: 2.1013864820763755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly Detection (IAD) is to enable machines to autonomously replicate this skill. However, current IAD algorithms are largely developed and tested on static, semantically simple datasets, which diverge from real-world scenarios where physical understanding and reasoning are essential. To bridge this gap, we introduce the Physics Anomaly Detection (Phys-AD) dataset, the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection. Collected using a real robot arm and motor, Phys-AD provides a diverse set of dynamic, semantically rich scenarios. The dataset includes more than 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies. Anomaly detection in Phys-AD requires visual reasoning, combining both physical knowledge and video content to determine object abnormality. We benchmark state-of-the-art anomaly detection methods under three settings: unsupervised AD, weakly-supervised AD, and video-understanding AD, highlighting their limitations in handling physics-grounded anomalies. Additionally, we introduce the Physics Anomaly Explanation (PAEval) metric, designed to assess the ability of visual-language foundation models to not only detect anomalies but also provide accurate explanations for their underlying physical causes. Our project is available at https://guyao2023.github.io/Phys-AD/.
Related papers
- HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies [30.95227838131802]
Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling.<n>We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens.
arXiv Detail & Related papers (2026-02-23T07:40:32Z) - Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos [82.4003989236851]
We propose a novel paradigm that leverages glitches in gameplay videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding.<n>We introduce PhysGame, a dataset containing 140,057 glitch-centric question-answer pairs across five physical domains and sixteen fine-grained categories.<n>Experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real world physical reasoning performance of Qwen2.5VL by 2.5%, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark.
arXiv Detail & Related papers (2026-01-23T06:02:07Z) - PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models [40.16417939211015]
Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning.<n>Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws.<n>We introduce PhysicsMind, a unified benchmark that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law.
arXiv Detail & Related papers (2026-01-22T14:33:01Z) - EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding [56.89359230139883]
We introduce Teleo-Spatial Intelligence (TSI), a new paradigm that unifies two critical pillars: Physical-Dynamic Reasoning and Intent-Driven Reasoning.<n>We present EscherVerse, consisting of a large-scale, open-world benchmark (Escher-Bench), a dataset (Escher-35k), and models (Escher series)<n>It is the first benchmark to systematically assess Intent-Driven Reasoning, challenging models to connect physical events to their underlying human purposes.
arXiv Detail & Related papers (2026-01-04T14:42:39Z) - PhysWorld: From Real Videos to World Models of Deformable Objects via Physics-Aware Demonstration Synthesis [52.905353023326306]
We propose PhysWorld, a framework that synthesizes physically plausible and diverse demonstrations to learn efficient world models.<n>Experiments show that PhysWorld has competitive performance while enabling inference speeds 47 times faster than the recent state-of-the-art method, i.e., PhysTwin.
arXiv Detail & Related papers (2025-10-24T13:25:39Z) - LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference [57.086932851733145]
We introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models.<n>We benchmark intuitive physics understanding in current video diffusion models.<n> Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
arXiv Detail & Related papers (2025-10-13T15:19:07Z) - MASIV: Toward Material-Agnostic System Identification from Videos [76.36666848173141]
MASIV is a vision-based framework for material-agnostic system identification.<n>It employs learnable neural models, inferring object dynamics without assuming a scene-specific material prior.<n>It achieves state-of-the-art performance in geometric accuracy, rendering quality, and generalization ability.
arXiv Detail & Related papers (2025-08-01T23:23:45Z) - Runtime Failure Hunting for Physics Engine Based Software Systems: How Far Can We Go? [32.20899533556529]
Physics Engines (PEs) are fundamental software frameworks that simulate physical interactions in applications ranging from entertainment to safety-critical systems.<n>PEs suffer from physics failures, deviations from expected physical behaviors that can compromise software reliability, degrade user experience, and potentially cause critical failures in autonomous vehicles or medical robotics.<n>This paper presents the first large-scale empirical study characterizing physics failures in PE-based software.
arXiv Detail & Related papers (2025-07-29T17:58:41Z) - Physics-informed Ground Reaction Dynamics from Human Motion Capture [4.4795626402834055]
We propose a novel method for estimating human ground reaction dynamics directly from motion capture data.<n>We introduce a highly accurate and robust method for computing ground reaction forces from motion capture data using Euler's integration scheme and PD algorithm.<n>The proposed approach was tested on the GroundLink dataset.
arXiv Detail & Related papers (2025-07-02T04:02:16Z) - Measuring Physical Plausibility of 3D Human Poses Using Physics Simulation [19.26289173517333]
We introduce two metrics to capture the physical plausibility and stability of predicted 3D poses from any 3D Human Pose Estimation model.<n>Using physics simulation, we discover correlations with existing plausibility metrics and measuring stability during motion.
arXiv Detail & Related papers (2025-02-06T20:15:49Z) - PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos [66.09921831504238]
We propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos.
Our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts.
Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM.
arXiv Detail & Related papers (2024-12-02T18:47:25Z) - The Sound of Water: Inferring Physical Properties from Pouring Liquids [85.30865788636386]
We study the connection between audio-visual observations and the underlying physics of pouring liquids.<n>Our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill.
arXiv Detail & Related papers (2024-11-18T01:19:37Z) - ContPhy: Continuum Physical Concept Learning and Reasoning from Videos [86.63174804149216]
ContPhy is a novel benchmark for assessing machine physical commonsense.
We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy.
We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models.
arXiv Detail & Related papers (2024-02-09T01:09:21Z) - PAD: A Dataset and Benchmark for Pose-agnostic Anomaly Detection [28.973078719467516]
We develop Multi-pose Anomaly Detection dataset and Pose-agnostic Anomaly Detection benchmark.
Specifically, we build MAD using 20 complex-shaped LEGO toys with various poses, and high-quality and diverse 3D anomalies in both simulated and real environments.
We also propose a novel method OmniposeAD, trained using MAD, specifically designed for pose-agnostic anomaly detection.
arXiv Detail & Related papers (2023-10-11T17:59:56Z) - Physically Grounded Vision-Language Models for Robotic Manipulation [59.143640049407104]
We propose PhysObjects, an object-centric dataset of 39.6K crowd-sourced and 417K automated physical concept annotations.
We show that fine-tuning a vision-language model on PhysObjects improves its understanding of physical object concepts.
We incorporate this physically grounded VLM in an interactive framework with a large language model-based robotic planner.
arXiv Detail & Related papers (2023-09-05T20:21:03Z) - Triggering Dark Showers with Conditional Dual Auto-Encoders [1.5615730862955413]
We present a family of conditional dual auto-encoders (CoDAEs) for generic and model-independent new physics searches at colliders.
arXiv Detail & Related papers (2023-06-22T15:13:18Z) - Trajectory Optimization for Physics-Based Reconstruction of 3d Human
Pose from Monocular Video [31.96672354594643]
We focus on the task of estimating a physically plausible articulated human motion from monocular video.
Existing approaches that do not consider physics often produce temporally inconsistent output with motion artifacts.
We show that our approach achieves competitive results with respect to existing physics-based methods on the Human3.6M benchmark.
arXiv Detail & Related papers (2022-05-24T18:02:49Z) - SPACE: A Simulator for Physical Interactions and Causal Learning in 3D
Environments [2.105564340986074]
We introduce SPACE: A Simulator for Physical Interactions and Causal Learning in 3D Environments.
Inspired by daily object interactions, the SPACE dataset comprises videos depicting three types of physical events: containment, stability and contact.
We show that the SPACE dataset improves the learning of intuitive physics with an approach inspired by curriculum learning.
arXiv Detail & Related papers (2021-08-13T11:49:46Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - Living in the Physics and Machine Learning Interplay for Earth
Observation [7.669855697331746]
Inferences mean understanding variables relations, deriving models that are physically interpretable.
Machine learning models alone are excellent approximators, but very often do not respect the most elementary laws of physics.
This is a collective long-term AI agenda towards developing and applying algorithms capable of discovering knowledge in the Earth system.
arXiv Detail & Related papers (2020-10-18T16:58:20Z) - Occlusion resistant learning of intuitive physics from videos [52.25308231683798]
Key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation.
This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences.
arXiv Detail & Related papers (2020-04-30T19:35:54Z) - Visual Grounding of Learned Physical Models [66.04898704928517]
Humans intuitively recognize objects' physical properties and predict their motion, even when the objects are engaged in complicated interactions.
We present a neural model that simultaneously reasons about physics and makes future predictions based on visual and dynamics priors.
Experiments show that our model can infer the physical properties within a few observations, which allows the model to quickly adapt to unseen scenarios and make accurate predictions into the future.
arXiv Detail & Related papers (2020-04-28T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.