Related papers: Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling

Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling

URL: http://arxiv.org/abs/2410.18912v1
Date: Thu, 24 Oct 2024 17:02:52 GMT
Title: Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling
Authors: Mingtong Zhang, Kaifeng Zhang, Yunzhu Li,
Abstract summary: We introduce a framework to learn object dynamics directly from multi-view RGB videos. We train a particle-based dynamics model using Graph Neural Networks. Our method can predict object motions under varying initial configurations and unseen robot actions.
Score: 10.247075501610492
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Videos of robots interacting with objects encode rich information about the objects' dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects' 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework's ability to model complex shapes and dynamics. Our project page is available at https://gs-dynamics.github.io.

Related papers

Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation [87.91642226587294]
Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data.<n>We propose a self-distillation framework that distills the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation.<n>Our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.
arXiv Detail & Related papers (2025-09-23T17:58:01Z)
GWM: Towards Scalable Gaussian World Models for Robotic Manipulation [53.51622803589185]
We propose a novel branch of world model named Gaussian World Model (GWM) for robotic manipulation.<n>At its core is a latent Diffusion Transformer (DiT) combined with a 3D variational autoencoder, enabling fine-grained scene-level future state reconstruction.<n>Both simulated and real-world experiments depict that GWM can precisely predict future scenes conditioned on diverse robot actions.
arXiv Detail & Related papers (2025-08-25T02:01:09Z)
Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos [30.367498271886866]
We develop a neural dynamics framework that combines object particles and spatial grids in a hybrid representation.<n>We demonstrate that our model learns the dynamics of diverse objects from sparse-view RGB-D recordings of robot-object interactions.<n>Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views.
arXiv Detail & Related papers (2025-06-18T17:59:38Z)
Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction [51.49400490437258]
This work develops a method for imitating articulated object manipulation from a single monocular RGB human demonstration. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot.
arXiv Detail & Related papers (2024-09-26T17:57:16Z)
Learning 3D Particle-based Simulators from RGB-D Videos [15.683877597215494]
We propose a method for learning simulators directly from observations. Visual Particle Dynamics (VPD) jointly learns a latent particle-based representation of 3D scenes. Unlike existing 2D video prediction models, VPD's 3D structure enables scene editing and long-term predictions.
arXiv Detail & Related papers (2023-12-08T20:45:34Z)
DeformGS: Scene Flow in Highly Deformable Scenes for Deformable Object Manipulation [66.7719069053058]
DeformGS is an approach to recover scene flow in highly deformable scenes using simultaneous video captures of a dynamic scene from multiple cameras. DeformGS improves 3D tracking by an average of 55.8% compared to the state-of-the-art. With sufficient texture, DeformGS achieves a median tracking error of 3.3 mm on a cloth of 1.5 x 1.5 m in area.
arXiv Detail & Related papers (2023-11-30T18:53:03Z)
3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes [68.66237114509264]
We present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids. We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space.
arXiv Detail & Related papers (2023-04-22T19:28:49Z)
T3VIP: Transformation-based 3D Video Prediction [49.178585201673364]
We propose a 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts. Our model is fully unsupervised, captures the nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals. To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera.
arXiv Detail & Related papers (2022-09-19T15:01:09Z)
Learning Multi-Object Dynamics with Compositional Neural Radiance Fields [63.424469458529906]
We present a method to learn compositional predictive models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks. NeRFs have become a popular choice for representing scenes due to their strong 3D prior. For planning, we utilize RRTs in the learned latent space, where we can exploit our model and the implicit object encoder to make sampling the latent space informative and more efficient.
arXiv Detail & Related papers (2022-02-24T01:31:29Z)
3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations. A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)
3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators [24.181604511269096]
We propose an action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-in 3D neural scene representation space. In this space, objects do not interfere with one another and their appearance persists over time and across viewpoints. We show our model generalizes well its predictions across varying number and appearances of interacting objects as well as across camera viewpoints.
arXiv Detail & Related papers (2020-11-12T16:15:52Z)
Learning 3D Dynamic Scene Representations for Robot Manipulation [21.6131570689398]
3D scene representation for robot manipulation should capture three key object properties: permanency, completeness, and continuity. We introduce 3D Dynamic Representation (DSR), a 3D scene representation that simultaneously discovers, tracks, reconstructs objects, and predicts their dynamics. We propose DSR-Net, which learns to aggregate visual observations over multiple interactions to gradually build and refine DSR.
arXiv Detail & Related papers (2020-11-03T19:23:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.