TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
- URL: http://arxiv.org/abs/2510.07550v1
- Date: Wed, 08 Oct 2025 21:03:46 GMT
- Title: TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
- Authors: Saman Motamed, Minghao Chen, Luc Van Gool, Iro Laina,
- Abstract summary: Video generative models produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing.<n>Existing Video-Language Models (VLMs) struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning.<n>We introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding.<n>We propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding.
- Score: 70.24211591214528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.
Related papers
- PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models [100.65199317765608]
Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation.<n>We introduce a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces.<n>We extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning.
arXiv Detail & Related papers (2026-01-16T08:40:10Z) - SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models [60.80050275581661]
Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities.<n>They lack a grounded understanding of physical dynamics.<n>We present S, a test-time, SIMulation-enabled ACTion Planning framework.<n>Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks.
arXiv Detail & Related papers (2025-12-05T18:51:03Z) - PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement [45.990473754456104]
Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks.<n>We propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs.<n>We show that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks.
arXiv Detail & Related papers (2025-12-04T07:28:56Z) - LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference [57.086932851733145]
We introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models.<n>We benchmark intuitive physics understanding in current video diffusion models.<n> Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
arXiv Detail & Related papers (2025-10-13T15:19:07Z) - Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility [37.011366226968]
Diffusion models can generate realistic videos, but existing methods rely on implicitly learning physical reasoning from large-scale text-video datasets.<n>We introduce a training-free framework that improves physical plausibility at inference time by explicitly reasoning about implausibility and guiding the generation away from it.
arXiv Detail & Related papers (2025-09-29T12:32:54Z) - Think Before You Diffuse: Infusing Physical Rules into Video Diffusion [55.046699347579455]
The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data.<n>We propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model.
arXiv Detail & Related papers (2025-05-27T18:26:43Z) - PhyMAGIC: Physical Motion-Aware Generative Inference with Confidence-guided LLM [17.554471769834453]
We present PhyMAGIC, a training-free framework that generates physically consistent motion from a single image.<n>PhyMAGIC integrates a pre-trained image-to-video diffusion model, confidence-guided reasoning via LLMs, and a differentiable physics simulator.<n> Comprehensive experiments demonstrate that PhyMAGIC outperforms state-of-the-art video generators and physics-aware baselines.
arXiv Detail & Related papers (2025-05-22T09:40:34Z) - VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior [88.51778468222766]
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos.<n>VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics.<n>We propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior.
arXiv Detail & Related papers (2025-03-30T09:03:09Z) - Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation [90.00687889213991]
Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities.<n>Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems.<n>In this paper, we introduce a novel test-time framework that enhancesVLMs' physical reasoning capabilities for multi-stage manipulation tasks.
arXiv Detail & Related papers (2025-02-23T20:42:15Z) - Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models [9.474337395173388]
Physical reasoning remains a significant challenge for Vision-Language Models (VLMs)<n>Fine-tuning is expensive for large models and impractical to repeatedly perform for every task.<n>We introduce Physics Context Builders (PCBs), a novel modular framework where specialized VLMs are fine-tuned to generate detailed physical scene descriptions.
arXiv Detail & Related papers (2024-12-11T18:40:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.