PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model
- URL: http://arxiv.org/abs/2511.01571v1
- Date: Mon, 03 Nov 2025 13:39:37 GMT
- Title: PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model
- Authors: Wenqi Liang, Gan Sun, Yao He, Jiahua Dong, Suyan Dai, Ivan Laptev, Salman Khan, Yang Cong,
- Abstract summary: Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies.<n>We introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs.<n>Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder.
- Score: 59.32370587806426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1%-17.8% over OpenVLA, while requiring only 1.5% of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments. The dataset and code will be released as open source.
Related papers
- X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model [62.21943953611646]
Vision-Language-Action models rely on effective training across diverse robotic platforms.<n>We propose a novel Soft Prompt approach with minimally added parameters.<n>We show that our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks.
arXiv Detail & Related papers (2025-10-11T16:20:17Z) - Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models [8.452688845632995]
We propose Oat-VLA, an Object-Agent-centric Tokenization for Vision-Language-Action (VLA) models.<n>We find that Oat-VLA can drastically reduce the number of visual tokens to just a few tokens without sacrificing performance.<n>We reveal that Oat-VLA converges at least twice as fast as OpenVLA on the LIBERO suite, as well as outperform OpenVLA in diverse real-world pick and place tasks.
arXiv Detail & Related papers (2025-09-28T05:42:53Z) - Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding [65.11838260342586]
We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks.<n>We propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs.<n>We also introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability.
arXiv Detail & Related papers (2025-04-14T17:52:22Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - Aquila-plus: Prompt-Driven Visual-Language Models for Pixel-Level Remote Sensing Image Understanding [0.0]
We propose a mask-text instruction tuning method called Aquila-plus to achieve pixel-level visual understanding.
Aquila-plus uses a convolutional CLIP as the visual encoder and employs a mask-aware visual extractor to extract precise visual mask features.
Experimental results demonstrate that Aquila-plus outperforms existing methods in various region understanding tasks.
arXiv Detail & Related papers (2024-11-09T10:42:57Z) - TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [32.406783380729024]
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes.<n>Current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data.<n>We introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models.
arXiv Detail & Related papers (2024-09-19T07:10:18Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.