Related papers: GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

URL: http://arxiv.org/abs/2512.16811v1
Date: Thu, 18 Dec 2025 17:51:42 GMT
Title: GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
Authors: Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang, Shaoshuai Shi, Li Jiang,
Abstract summary: Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric.<n>We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors.<n>Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines.
Score: 26.632472450402947
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

Related papers

Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation [53.09168514034483]
Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions.<n>We propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model.<n>Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap.
arXiv Detail & Related papers (2026-02-27T08:54:20Z)
StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation [6.0744834626758495]
StemVLA is a novel framework that explicitly incorporates both future-oriented 3D spatial knowledge and historical 4D representations into action prediction.<n>We show that StemVLA significantly improves long-horizon task success and state-of-the-art performance on the CALVIN ABC-D benchmark [46], achieving an average sequence length of XXX.
arXiv Detail & Related papers (2026-02-27T06:43:37Z)
Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction [10.394184895110007]
We present GPOcc, a framework that leverages visual geometry priors for monocular occupancy prediction.<n>Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains.
arXiv Detail & Related papers (2026-02-25T04:16:54Z)
Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video [76.32954467706581]
We propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams.<n>We use a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision.<n>Experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks.
arXiv Detail & Related papers (2026-02-08T09:53:21Z)
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models [79.18306680174011]
DSR Suite bridges gap across aspects of dataset, benchmark and model.<n>We propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR.<n>The pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories.
arXiv Detail & Related papers (2025-12-23T17:56:36Z)
MP-GFormer: A 3D-Geometry-Aware Dynamic Graph Transformer Approach for Machining Process Planning [0.43553942673960666]
We propose MP-GFormer, a 3D-geometry-aware dynamic graph that integrates evolving 3D geometric representations into DGL to predict machining operation sequences.<n>Our approach leverages StereoLithography surface meshes representing the 3D geometry of a part after each machining operation with the boundary representation method used for the initial 3D designs.
arXiv Detail & Related papers (2025-11-14T19:58:39Z)
PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting [56.188624157291024]
We introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction from unposed two-view images.<n>Unlike prior feedforward methods that require 3D plane annotations during training, PLANA3R learns planar 3D structures without explicit plane supervision.<n>We validate PLANA3R on multiple indoor-scene datasets with metric supervision and demonstrate strong generalization to out-of-domain indoor environments.
arXiv Detail & Related papers (2025-10-21T15:15:33Z)
GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation [57.8059956428009]
Recent attempts to transfer features from 2D Vision-Language Models to 3D semantic segmentation expose a persistent trade-off.<n>We propose GeoPurify that applies a small Student Affinity Network to 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model.<n>Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency.
arXiv Detail & Related papers (2025-10-02T16:37:56Z)
TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking [25.788917457593673]
TrackAny3D is the first framework to transfer large-scale pretrained 3D models for category-agnostic 3D SOT.<n>MoGE architecture adaptively activates specialized 3works based on distinct geometric characteristics.<n>Experiments show that TrackAny3D establishes new state-of-the-art performance on category-agnostic 3D SOT.
arXiv Detail & Related papers (2025-07-26T10:41:55Z)
E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models [78.1674905950243]
We present the first comprehensive benchmark for 3D geometric foundation models (GFMs)<n>GFMs directly predict dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters.<n>We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains.<n>All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
arXiv Detail & Related papers (2025-06-02T17:53:09Z)
GaussRender: Learning 3D Occupancy with Gaussian Rendering [86.89653628311565]
GaussRender is a module that improves 3D occupancy learning by enforcing projective consistency.<n>Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent 3D structure.
arXiv Detail & Related papers (2025-02-07T16:07:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.