ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
- URL: http://arxiv.org/abs/2511.18082v1
- Date: Sat, 22 Nov 2025 14:44:03 GMT
- Title: ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
- Authors: Wencheng Ye, Tianshi Wang, Lei Zhu, Fengling Li, Guoli Yang,
- Abstract summary: We present ActDistill, a framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart.<n>We employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction.<n>Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup.
- Score: 14.202025149504715
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent Vision-Language-Action (VLA) models have shown impressive flexibility and generalization, yet their deployment in robotic manipulation remains limited by heavy computational overhead and inference latency. In this work, we present ActDistill, a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. Unlike previous efficiency strategies that primarily emphasize vision-language correlations, ActDistill leverages action priors to guide knowledge transfer and model compression, achieving action-oriented efficiency for VLA models. Specifically, we employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction. The student model, derived from the graph-encapsulated teacher, is further equipped with a dynamic router that adaptively selects computation paths based on action prediction demands, guided by hierarchical graph-informed supervision to ensure smooth and efficient evolution. During inference, graph-related auxiliary components are removed, allowing the student to execute only dynamically routed layers and predict high-precision actions with minimal computation and latency. Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup, thereby establishing a general paradigm toward efficient embodied intelligence.
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - ActionCodec: What Makes for Good Action Tokenizers [106.78093973045526]
Vision-Language-Action (VLA) models have demonstrated superior instruction-following and training efficiency.<n>Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity.<n>We introduce textbfActionCodec, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance.
arXiv Detail & Related papers (2026-02-17T07:07:15Z) - Learning Generalizable Visuomotor Policy through Dynamics-Alignment [13.655111993491674]
Recent approaches leveraging video prediction models have shown promising results by learning rich representations from large-scale datasets.<n>We propose a Dynamics-Aligned Flow Matching Policy (DAP) that integrates dynamics prediction into policy learning.<n>Our method introduces a novel architecture where policy and dynamics models provide mutual corrective feedback during action generation, enabling self-correction and improved generalization.
arXiv Detail & Related papers (2025-10-31T02:29:33Z) - DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving [52.63591791507895]
We propose textbfDriveVLA-W0, a training paradigm that employs world modeling to predict future images.<n>This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment.<n>Experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines.
arXiv Detail & Related papers (2025-10-14T17:59:47Z) - VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation [76.13140980997508]
Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs)<n>We propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models.<n>In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement)
arXiv Detail & Related papers (2025-10-10T17:59:56Z) - Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments [8.020686883632594]
Progressive Weight Loading (PWL) is a technique that enables fast initial inference by first deploying a lightweight student model, then incrementally replacing its layers with those of a pre-trained teacher model.<n>Our experiments on VGG, ResNet, and ViT architectures demonstrate that models trained with PWL maintain competitive distillation performance and gradually improve accuracy as teacher layers are loaded-matching the final accuracy of the full teacher model.
arXiv Detail & Related papers (2025-09-26T13:19:32Z) - EdgeVLA: Efficient Vision-Language-Action Models [0.4005096060512278]
This paper introduces Edge VLA, a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models.<n>We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs)<n>Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency.
arXiv Detail & Related papers (2025-07-18T16:15:09Z) - SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [70.72227437717467]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z) - Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [100.226572152954]
We present an optimized fine-tuning recipe for vision-language-action models (VLAs)<n>Our recipe boosts OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$times$.<n>In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot.
arXiv Detail & Related papers (2025-02-27T00:30:29Z) - Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models [4.737806982257592]
This study proposes a knowledge distillation algorithm based on large language models and feature alignment.<n>The proposed model performs very close to the state-of-the-art GPT-4 model in terms of evaluation indicators such as perplexity, BLEU, ROUGE, and CER.
arXiv Detail & Related papers (2024-12-27T04:37:06Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.