HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies
- URL: http://arxiv.org/abs/2602.19571v1
- Date: Mon, 23 Feb 2026 07:40:32 GMT
- Title: HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies
- Authors: Chang Liu, Yunfan Ye, Qingyang Zhou, Xichen Tan, Mengxuan Luo, Zhenyu Qiu, Wei Peng, Zhiping Cai,
- Abstract summary: Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling.<n>We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens.
- Score: 30.95227838131802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling, which is central to physically grounded intelligence. We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens. HOCA-Bench separates anomalies into two types: ontological anomalies, where an entity violates its own definition or persistence, and causal anomalies, where interactions violate physical relations. Using state-of-the-art generative video models as adversarial simulators, we build a testbed of 1,439 videos (3,470 QA pairs). Evaluations on 17 Video-LLMs show a clear cognitive lag: models often identify static ontological violations (e.g., shape mutations) but struggle with causal mechanisms (e.g., gravity or friction), with performance dropping by more than 20% on causal tasks. System-2 "Thinking" modes improve reasoning, but they do not close the gap, suggesting that current architectures recognize visual patterns more readily than they apply basic physical laws.
Related papers
- Spatial Causal Prediction in Video [56.22332198384257]
We introduce a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes.<n>We construct a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions.<n>Through comprehensive experiments on 23 state-of-the-art models, we reveal substantial gaps between human and model performance limited temporal extrapolation, and weak causal grounding.
arXiv Detail & Related papers (2026-03-04T11:09:39Z) - A Mechanistic View on Video Generation as World Models: State and Dynamics [43.951972667861575]
This work proposes a novel taxonomy centered on two pillars: State Construction and Dynamics Modeling.<n>By addressing these challenges, the field can evolve from generating visually plausible videos to building robust, general-purpose world simulators.
arXiv Detail & Related papers (2026-01-22T19:00:18Z) - MMGR: Multi-Modal Generative Reasoning [97.44203203196481]
We introduce MMGR, a principled evaluation framework based on five reasoning abilities.<n> MMGR evaluates generative reasoning across three domains: Abstract Reasoning, Embodied Navigation, and Physical Commonsense.<n>We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image)
arXiv Detail & Related papers (2025-12-16T18:58:04Z) - Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark [124.00111584020834]
We conduct an empirical study to investigate whether video models are ready to serve as zero-shot reasoners.<n>We focus on the leading and popular Veo-3.<n>We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic.
arXiv Detail & Related papers (2025-10-30T17:59:55Z) - TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility [70.24211591214528]
Video generative models produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing.<n>Existing Video-Language Models (VLMs) struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning.<n>We introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding.<n>We propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding.
arXiv Detail & Related papers (2025-10-08T21:03:46Z) - Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection [2.1013864820763755]
Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge.<n>Phys-AD is the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection.<n>The dataset includes more than 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies.
arXiv Detail & Related papers (2025-03-05T14:49:08Z) - Position: Stop Making Unscientific AGI Performance Claims [6.343515088115924]
Developments in the field of Artificial Intelligence (AI) have created a 'perfect storm' for observing'sparks' of Artificial General Intelligence (AGI)
We argue and empirically demonstrate that the finding of meaningful patterns in latent spaces of models cannot be seen as evidence in favor of AGI.
We conclude that both the methodological setup and common public image of AI are ideal for the misinterpretation that correlations between model representations and some variables of interest are 'caused' by the model's understanding of underlying 'ground truth' relationships.
arXiv Detail & Related papers (2024-02-06T12:42:21Z) - Learning Physical Dynamics with Subequivariant Graph Neural Networks [99.41677381754678]
Graph Neural Networks (GNNs) have become a prevailing tool for learning physical dynamics.
Physical laws abide by symmetry, which is a vital inductive bias accounting for model generalization.
Our model achieves on average over 3% enhancement in contact prediction accuracy across 8 scenarios on Physion and 2X lower rollout MSE on RigidFall.
arXiv Detail & Related papers (2022-10-13T10:00:30Z) - Causal Expectation-Maximisation [70.45873402967297]
We show that causal inference is NP-hard even in models characterised by polytree-shaped graphs.
We introduce the causal EM algorithm to reconstruct the uncertainty about the latent variables from data about categorical manifest variables.
We argue that there appears to be an unnoticed limitation to the trending idea that counterfactual bounds can often be computed without knowledge of the structural equations.
arXiv Detail & Related papers (2020-11-04T10:25:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.