Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models
- URL: http://arxiv.org/abs/2512.13609v1
- Date: Mon, 15 Dec 2025 18:03:42 GMT
- Title: Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models
- Authors: Shweta Mahajan, Shreya Kadambi, Hoang Le, Munawar Hayat, Fatih Porikli,
- Abstract summary: We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models.<n>Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world.
- Score: 57.71440995598757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.
Related papers
- Learning to Act Robustly with View-Invariant Latent Actions [8.446887947386559]
Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations.<n>We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics.<n> Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks.
arXiv Detail & Related papers (2026-01-06T13:14:01Z) - SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models [60.80050275581661]
Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities.<n>They lack a grounded understanding of physical dynamics.<n>We present S, a test-time, SIMulation-enabled ACTion Planning framework.<n>Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks.
arXiv Detail & Related papers (2025-12-05T18:51:03Z) - PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction [52.44375492811009]
We present PhysHMR, a unified framework that learns a visual-to-action policy for humanoid control in a physics-based simulator.<n>A key component of our approach is the pixel-as-ray strategy, which lifts 2D keypoints into 3D spatial rays and transforms them into global space.<n>PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.
arXiv Detail & Related papers (2025-10-02T21:01:11Z) - SimGenHOI: Physically Realistic Whole-Body Humanoid-Object Interaction via Generative Modeling and Reinforcement Learning [6.255814224573073]
SimGenHOI is a unified framework that combines the strengths of generative modeling and reinforcement learning to produce controllable and physically plausible HOI.<n>Our HOI generative model, based on Diffusion Transformers (DiT), predicts a set of key actions conditioned on text prompts, object geometry, sparse object waypoints, and the initial humanoid pose.<n>To ensure physical realism, we design a contact-aware whole-body control policy trained with reinforcement learning, which tracks the generated motions while correcting artifacts such as penetration and foot sliding.
arXiv Detail & Related papers (2025-08-18T15:20:46Z) - Scan, Materialize, Simulate: A Generalizable Framework for Physically Grounded Robot Planning [16.193477346643295]
Scan, Materialize, Simulate (SMS) is a unified framework that combines 3D Gaussian Splatting for accurate scene reconstruction, visual foundation models for semantic segmentation, vision-language models for material property inference, and physics simulation for reliable prediction of action outcomes.<n>Our results highlight the potential of bridging differentiable rendering for scene reconstruction, foundation models for semantic understanding, and physics-based simulation to achieve physically grounded robot planning across diverse settings.
arXiv Detail & Related papers (2025-05-20T21:55:01Z) - Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Learning the Effects of Physical Actions in a Multi-modal Environment [17.757831697284498]
Large Language Models (LLMs) handle physical commonsense information inadequately.
We introduce the multi-modal task of predicting the outcomes of actions solely from realistic sensory inputs.
We show that multi-modal models can capture physical commonsense when augmented with visual information.
arXiv Detail & Related papers (2023-01-27T16:49:52Z) - ACID: Action-Conditional Implicit Visual Dynamics for Deformable Object
Manipulation [135.10594078615952]
We introduce ACID, an action-conditional visual dynamics model for volumetric deformable objects.
A benchmark contains over 17,000 action trajectories with six types of plush toys and 78 variants.
Our model achieves the best performance in geometry, correspondence, and dynamics predictions.
arXiv Detail & Related papers (2022-03-14T04:56:55Z) - Model-Based Inverse Reinforcement Learning from Visual Demonstrations [20.23223474119314]
We present a gradient-based inverse reinforcement learning framework that learns cost functions when given only visual human demonstrations.
The learned cost functions are then used to reproduce the demonstrated behavior via visual model predictive control.
We evaluate our framework on hardware on two basic object manipulation tasks.
arXiv Detail & Related papers (2020-10-18T17:07:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.