NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation
- URL: http://arxiv.org/abs/2510.03895v1
- Date: Sat, 04 Oct 2025 18:26:55 GMT
- Title: NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation
- Authors: Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, Chunhua Shen,
- Abstract summary: Vision-Language-Action (VLA) models confront critical barriers to real-world deployment, most notably catastrophic forgetting.<n>We propose the Narrowing of Trajectory VLA framework: a novel approach that narrows its focus to sparse trajectories.<n>NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints.
- Score: 54.87964060934928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language-Action (VLA) models represent a pivotal advance in embodied intelligence, yet they confront critical barriers to real-world deployment, most notably catastrophic forgetting. This issue stems from their overreliance on continuous action sequences or action chunks, which inadvertently create isolated data silos that disrupt knowledge retention across tasks. To tackle these challenges, we propose the Narrowing of Trajectory VLA (NoTVLA) framework: a novel approach that narrows its focus to sparse trajectories, thereby avoiding the catastrophic forgetting associated with dense trajectory fine-tuning. A key innovation of NoTVLA lies in its trajectory planning strategy: instead of centering on the target object's trajectory, it leverages temporal compression and spatial reasoning pruning specifically for the robot end effector's trajectory. Furthermore, training is conducted using these sparse trajectories rather than dense action trajectories, an optimization that delivers remarkable practical advantages with better performance in zero-shot. In multi-task evaluation scenarios, NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints: it uses over an order of magnitude less computing power than pi0 and requires no wrist-mounted camera. This design ensures that NoTVLA's operational accuracy closely approximates that of single-task expert models. Crucially, it also preserves the model's inherent language capabilities, enabling zero-shot generalization in specific scenarios, supporting unified model deployment across multiple robot platforms, and fostering a degree of generalization even when perceiving tasks from novel perspectives.
Related papers
- ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation [55.467742403416175]
We introduce a physics-driven neural algorithm that translates large-scale motion capture to humanoid embodiments.<n>We learn a unified multimodal controller that supports both dense references and sparse task specifications.<n>Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception.
arXiv Detail & Related papers (2026-03-03T18:59:29Z) - Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons [69.87766750714945]
General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations.<n>We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision.<n>Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints.
arXiv Detail & Related papers (2026-03-02T17:38:58Z) - Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models [51.43746425777865]
Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks.<n>We propose PILOT, a framework designed to internalize the strategic oversight of large models into intrinsic Latent Guidance.
arXiv Detail & Related papers (2026-01-07T12:38:56Z) - TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking [30.955088934475928]
We present TrackVLA++, a novel model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and atemporal Identification Memory (TIM)<n>TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings.
arXiv Detail & Related papers (2025-10-08T15:29:17Z) - Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert [60.88976842557026]
Vision-Language Models (VLM) have demonstrated impressive planning and reasoning capabilities.<n>Recent dual-system approaches attempt to decouple "thinking" from "acting"<n>We introduce a framework centered around a generalizable action expert.
arXiv Detail & Related papers (2025-10-04T18:33:27Z) - Efficient Virtuoso: A Latent Diffusion Transformer Model for Goal-Conditioned Trajectory Planning [0.0]
We present the Efficient Virtuoso, a conditional latent diffusion model for goal-conditioned trajectory planning.<n>We demonstrate that our method achieves state-of-the-art performance on the Open Motion dataset, achieving a minimum Average Displacement Error (minADE) of 0.25.<n>We provide a key insight: while a single goal can resolve strategic ambiguity, a richer, multi-step sparse route is essential for enabling the precise, high-fidelity tactical execution that mirrors nuanced human driving behavior.
arXiv Detail & Related papers (2025-09-03T19:18:02Z) - Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance [63.33213516925946]
We introduce textbfAlign-Then-stEer (textttATE), a novel, data-efficient, and plug-and-play adaptation framework.<n>Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.
arXiv Detail & Related papers (2025-09-02T07:51:59Z) - Action-Constrained Imitation Learning [12.316546911223263]
Policy learning under action constraints plays a central role in ensuring safe behaviors in various robot control and resource allocation applications.<n>In this paper, we study a new problem setting termed Action-Constrained Imitation Learning (ACIL), where an action-constrained imitator aims to learn from a demonstrative expert with larger action space.<n>We tackle this mismatch through textittrajectory alignment and propose DTWIL, which replaces the original expert demonstrations with a surrogate dataset that follows similar state trajectories while adhering to the action constraints.
arXiv Detail & Related papers (2025-08-20T03:19:07Z) - Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics [68.36528819227641]
This paper systematically evaluates the robustness of Vision-Language-Action (VLA) models.<n>We introduce two untargeted attack objectives that leverage spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory.<n>We design an adversarial patch generation approach that places a small, colorful patch within the camera's view, effectively executing the attack in both digital and physical environments.
arXiv Detail & Related papers (2024-11-18T01:52:20Z) - Latent Weight Diffusion: Generating reactive policies instead of trajectories [12.270795590154489]
We propose Latent Weight Diffusion to generate closed-loop policies for robotic tasks.<n>LWD has higher success rates than Diffusion Policy when the action horizon is longer.<n>LWD achieves multitask performance comparable to DP while requiring just 1/45th of the inference-time FLOPS.
arXiv Detail & Related papers (2024-10-17T21:30:29Z) - Integrating DeepRL with Robust Low-Level Control in Robotic Manipulators for Non-Repetitive Reaching Tasks [0.24578723416255746]
In robotics, contemporary strategies are learning-based, characterized by a complex black-box nature and a lack of interpretability.
We propose integrating a collision-free trajectory planner based on deep reinforcement learning (DRL) with a novel auto-tuning low-level control strategy.
arXiv Detail & Related papers (2024-02-04T15:54:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.