Related papers: PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention

PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention

URL: http://arxiv.org/abs/2512.03724v2
Date: Mon, 08 Dec 2025 15:51:37 GMT
Title: PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention
Authors: Ziwen Li, Xin Wang, Hanlue Zhang, Runnan Chen, Runqi Lin, Xiao He, Han Huang, Yandong Guo, Fakhri Karray, Tongliang Liu, Mingming Gong,
Abstract summary: PosA-VLA framework anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions.<n>We show that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks.
Score: 92.85371254435074
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.

Related papers

Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models [7.802379200026965]
We propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state.<n>Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators.
arXiv Detail & Related papers (2026-03-05T13:14:41Z)
From Perception to Action: An Interactive Benchmark for Vision Reasoning [51.11355591375073]
Causal Hierarchy of Actions and Interactions (CHAIN) benchmark designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints.<n> CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing.<n>Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions.
arXiv Detail & Related papers (2026-02-24T15:33:02Z)
Universal Pose Pretraining for Generalizable Vision-Language-Action Policies [83.39008378156647]
Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency.<n>We propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors.<n>Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment.
arXiv Detail & Related papers (2026-02-23T11:00:08Z)
\ extsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation [50.027425808733994]
textscNaVIDA is a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution.<n>textscNaVIDA augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions.<n>Experiments show that textscNaVIDA achieves superior navigation performance compared to state-of-the-art methods with fewer parameters.
arXiv Detail & Related papers (2026-01-26T06:16:17Z)
Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA [21.362682837521632]
Latent Action Models (LAMs) enable Vision- Language-Action systems to learn semantic action rep- resentations from large-scale unannotated data.<n>We propose Farsighted-LAM, a latent action framework with geometry- aware spatial encoding and multi-scale temporal modeling.<n>We further propose SSM-VLA, an end- to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module.
arXiv Detail & Related papers (2025-09-30T13:41:43Z)
PhysiAgent: An Embodied Agent Framework in Physical World [33.821400205384144]
Vision-Language-Action (VLA) models have achieved notable success but often struggle with limited generalizations.<n>Current approaches often combine these models in rigid, sequential structures.<n>We propose an embodied agent framework, PhysiAgent, tailored to operate effectively in physical environments.
arXiv Detail & Related papers (2025-09-29T09:39:32Z)
Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy [47.51062818231493]
We introduce the Observation-Centric VLA (OC-VLA) framework, which grounds action predictions directly in the camera observation space.<n>OC-VLA transforms end-effector poses from the robot base coordinate system into the camera coordinate system.<n>This strategy substantially improves model resilience to camera viewpoint variations.
arXiv Detail & Related papers (2025-08-18T17:10:45Z)
SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [70.72227437717467]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [73.75271615101754]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill Primitives [89.34229413345541]
We propose a conditioning scheme which avoids pitfalls by learning the controller and its conditioning in an end-to-end manner. Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion. We report significant improvements in task success over representative MPC and IL baselines.
arXiv Detail & Related papers (2020-03-19T15:04:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.