Related papers: Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

URL: http://arxiv.org/abs/2602.19710v1
Date: Mon, 23 Feb 2026 11:00:08 GMT
Title: Universal Pose Pretraining for Generalizable Vision-Language-Action Policies
Authors: Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, Yanwei Fu,
Abstract summary: Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency.<n>We propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors.<n>Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment.
Score: 83.39008378156647
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.

Related papers

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation [95.89924101984566]
We introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM)<n>GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories.<n>LCM injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory.
arXiv Detail & Related papers (2026-02-22T15:39:34Z)
PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention [92.85371254435074]
PosA-VLA framework anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions.<n>We show that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks.
arXiv Detail & Related papers (2025-12-03T12:14:29Z)
iFlyBot-VLA Technical Report [25.330744626382977]
We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework.<n>The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; and (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets.
arXiv Detail & Related papers (2025-11-01T06:24:56Z)
Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert [60.88976842557026]
Vision-Language Models (VLM) have demonstrated impressive planning and reasoning capabilities.<n>Recent dual-system approaches attempt to decouple "thinking" from "acting"<n>We introduce a framework centered around a generalizable action expert.
arXiv Detail & Related papers (2025-10-04T18:33:27Z)
Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations [26.678553477485362]
We present a framework that better preserves pretrained features while adapting them for robot manipulation.<n>Our approach introduces three components: (i) a dual-encoder design with one frozen vision to retain pretrained features and another trainable for task adaptation, (ii) a string-based action tokenizer that casts continuous actions into character sequences aligned with the model's pretraining domain, and (iii) a co-training strategy that combines robot demonstrations with vision-language datasets emphasizing spatial reasoning and affordances.
arXiv Detail & Related papers (2025-09-14T20:08:56Z)
Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy [47.51062818231493]
We introduce the Observation-Centric VLA (OC-VLA) framework, which grounds action predictions directly in the camera observation space.<n>OC-VLA transforms end-effector poses from the robot base coordinate system into the camera coordinate system.<n>This strategy substantially improves model resilience to camera viewpoint variations.
arXiv Detail & Related papers (2025-08-18T17:10:45Z)
TrackVLA: Embodied Visual Tracking in the Wild [34.03604806748204]
Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision.<n>Existing approaches typically address this challenge through a modular separation of recognition and planning.<n>We propose TrackVLA, a Vision-Language-Action model that learns the synergy between object recognition and trajectory planning.
arXiv Detail & Related papers (2025-05-29T07:28:09Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [73.75271615101754]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
PointVLA: Injecting the 3D World into Vision-Language-Action Models [10.758939578236582]
We propose PointVLA, a framework that enhances pre-trained vision-language-action models with point cloud inputs without requiring retraining.<n>Our method freezes the vanilla action expert and injects 3D features via a lightweight modular block.<n>PointVLA outperforms state-of-the-art 2D imitation learning methods across both simulated and real-world robotic tasks.
arXiv Detail & Related papers (2025-03-10T16:32:41Z)
Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z)
Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation [52.94078950641959]
We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation. We adopt a novel neural representation of multi-person 3D pose which unifies the position of person instances with their corresponding 3D pose representation. We propose a practical deployment paradigm where paired 2D or 3D pose annotations are unavailable.
arXiv Detail & Related papers (2020-08-04T07:54:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.