Related papers: RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

URL: http://arxiv.org/abs/2512.13660v1
Date: Mon, 15 Dec 2025 18:52:43 GMT
Title: RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
Authors: Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang,
Abstract summary: We propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring.<n>RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning.<n>We present TraceSpatial-Bench, a challenging benchmark to evaluate spatial tracing.
Score: 53.053660003572965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.

Related papers

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons [69.87766750714945]
General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations.<n>We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision.<n>Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints.
arXiv Detail & Related papers (2026-03-02T17:38:58Z)
SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation [63.48859753472547]
SpatialActor is a framework for robust robotic manipulation that explicitly decouples semantics and geometry.<n>It achieves state-of-the-art performance with 87.4% on RLBench and improves by 13.9% to 19.4% under varying noisy conditions.
arXiv Detail & Related papers (2025-11-12T18:59:08Z)
GRASPTrack: Geometry-Reasoned Association via Segmentation and Projection for Multi-Object Tracking [11.436294975354556]
GRASPTrack is a novel MOT framework that integrates monocular depth estimation and instance segmentation into a standard TBD pipeline.<n>These 3D point clouds are then voxelized to enable a precise and robust Voxel-Based 3D Intersection-over-Union.
arXiv Detail & Related papers (2025-08-11T15:56:21Z)
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics [67.11221574129937]
Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world.<n>We propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding.<n>RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning.
arXiv Detail & Related papers (2025-06-04T17:59:27Z)
Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors [25.67875816218477]
Full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range.<n>Previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external visual sensors to obtain global positions of key joints.<n>To improve the practicality of the technology for virtual reality applications, we estimate full-body poses using only inertial data obtained from three Inertial Measurement Unit (IMU) sensors worn on the head and wrists.
arXiv Detail & Related papers (2025-05-08T15:28:09Z)
ODTFormer: Efficient Obstacle Detection and Tracking with Stereo Cameras Based on Transformer [12.58804521609764]
ODTFormer is a Transformer-based model to address both obstacle detection and tracking problems. We report comparable accuracy to state-of-the-art obstacle tracking models while requiring only a fraction of their cost.
arXiv Detail & Related papers (2024-03-21T17:59:55Z)
An Effective Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds [50.19288542498838]
3D single object tracking in LiDAR point clouds (LiDAR SOT) plays a crucial role in autonomous driving. Current approaches all follow the Siamese paradigm based on appearance matching. We introduce a motion-centric paradigm to handle LiDAR SOT from a new perspective.
arXiv Detail & Related papers (2023-03-21T17:28:44Z)
Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem. We employ a Neural Message Passing network for data association that is fully trainable. We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.