GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
- URL: http://arxiv.org/abs/2512.16811v1
- Date: Thu, 18 Dec 2025 17:51:42 GMT
- Title: GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
- Authors: Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang, Shaoshuai Shi, Li Jiang,
- Abstract summary: Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric.<n>We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors.<n>Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines.
- Score: 26.632472450402947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.
Related papers
- Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation [53.09168514034483]
Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions.<n>We propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model.<n>Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap.
arXiv Detail & Related papers (2026-02-27T08:54:20Z) - StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation [6.0744834626758495]
StemVLA is a novel framework that explicitly incorporates both future-oriented 3D spatial knowledge and historical 4D representations into action prediction.<n>We show that StemVLA significantly improves long-horizon task success and state-of-the-art performance on the CALVIN ABC-D benchmark [46], achieving an average sequence length of XXX.
arXiv Detail & Related papers (2026-02-27T06:43:37Z) - Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction [10.394184895110007]
We present GPOcc, a framework that leverages visual geometry priors for monocular occupancy prediction.<n>Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains.
arXiv Detail & Related papers (2026-02-25T04:16:54Z) - Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video [76.32954467706581]
We propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams.<n>We use a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision.<n>Experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks.
arXiv Detail & Related papers (2026-02-08T09:53:21Z) - Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models [79.18306680174011]
DSR Suite bridges gap across aspects of dataset, benchmark and model.<n>We propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR.<n>The pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories.
arXiv Detail & Related papers (2025-12-23T17:56:36Z) - MP-GFormer: A 3D-Geometry-Aware Dynamic Graph Transformer Approach for Machining Process Planning [0.43553942673960666]
We propose MP-GFormer, a 3D-geometry-aware dynamic graph that integrates evolving 3D geometric representations into DGL to predict machining operation sequences.<n>Our approach leverages StereoLithography surface meshes representing the 3D geometry of a part after each machining operation with the boundary representation method used for the initial 3D designs.
arXiv Detail & Related papers (2025-11-14T19:58:39Z) - PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting [56.188624157291024]
We introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction from unposed two-view images.<n>Unlike prior feedforward methods that require 3D plane annotations during training, PLANA3R learns planar 3D structures without explicit plane supervision.<n>We validate PLANA3R on multiple indoor-scene datasets with metric supervision and demonstrate strong generalization to out-of-domain indoor environments.
arXiv Detail & Related papers (2025-10-21T15:15:33Z) - GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation [57.8059956428009]
Recent attempts to transfer features from 2D Vision-Language Models to 3D semantic segmentation expose a persistent trade-off.<n>We propose GeoPurify that applies a small Student Affinity Network to 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model.<n>Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency.
arXiv Detail & Related papers (2025-10-02T16:37:56Z) - TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking [25.788917457593673]
TrackAny3D is the first framework to transfer large-scale pretrained 3D models for category-agnostic 3D SOT.<n>MoGE architecture adaptively activates specialized 3works based on distinct geometric characteristics.<n>Experiments show that TrackAny3D establishes new state-of-the-art performance on category-agnostic 3D SOT.
arXiv Detail & Related papers (2025-07-26T10:41:55Z) - E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models [78.1674905950243]
We present the first comprehensive benchmark for 3D geometric foundation models (GFMs)<n>GFMs directly predict dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters.<n>We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains.<n>All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
arXiv Detail & Related papers (2025-06-02T17:53:09Z) - GaussRender: Learning 3D Occupancy with Gaussian Rendering [86.89653628311565]
GaussRender is a module that improves 3D occupancy learning by enforcing projective consistency.<n>Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent 3D structure.
arXiv Detail & Related papers (2025-02-07T16:07:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.