Related papers: ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

URL: http://arxiv.org/abs/2511.18082v1
Date: Sat, 22 Nov 2025 14:44:03 GMT
Title: ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
Authors: Wencheng Ye, Tianshi Wang, Lei Zhu, Fengling Li, Guoli Yang,
Abstract summary: We present ActDistill, a framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart.<n>We employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction.<n>Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup.
Score: 14.202025149504715
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent Vision-Language-Action (VLA) models have shown impressive flexibility and generalization, yet their deployment in robotic manipulation remains limited by heavy computational overhead and inference latency. In this work, we present ActDistill, a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. Unlike previous efficiency strategies that primarily emphasize vision-language correlations, ActDistill leverages action priors to guide knowledge transfer and model compression, achieving action-oriented efficiency for VLA models. Specifically, we employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction. The student model, derived from the graph-encapsulated teacher, is further equipped with a dynamic router that adaptively selects computation paths based on action prediction demands, guided by hierarchical graph-informed supervision to ensure smooth and efficient evolution. During inference, graph-related auxiliary components are removed, allowing the student to execute only dynamically routed layers and predict high-precision actions with minimal computation and latency. Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup, thereby establishing a general paradigm toward efficient embodied intelligence.

Related papers

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z)
ActionCodec: What Makes for Good Action Tokenizers [106.78093973045526]
Vision-Language-Action (VLA) models have demonstrated superior instruction-following and training efficiency.<n>Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity.<n>We introduce textbfActionCodec, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance.
arXiv Detail & Related papers (2026-02-17T07:07:15Z)
Learning Generalizable Visuomotor Policy through Dynamics-Alignment [13.655111993491674]
Recent approaches leveraging video prediction models have shown promising results by learning rich representations from large-scale datasets.<n>We propose a Dynamics-Aligned Flow Matching Policy (DAP) that integrates dynamics prediction into policy learning.<n>Our method introduces a novel architecture where policy and dynamics models provide mutual corrective feedback during action generation, enabling self-correction and improved generalization.
arXiv Detail & Related papers (2025-10-31T02:29:33Z)
DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving [52.63591791507895]
We propose textbfDriveVLA-W0, a training paradigm that employs world modeling to predict future images.<n>This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment.<n>Experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines.
arXiv Detail & Related papers (2025-10-14T17:59:47Z)
VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation [76.13140980997508]
Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs)<n>We propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models.<n>In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement)
arXiv Detail & Related papers (2025-10-10T17:59:56Z)
Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments [8.020686883632594]
Progressive Weight Loading (PWL) is a technique that enables fast initial inference by first deploying a lightweight student model, then incrementally replacing its layers with those of a pre-trained teacher model.<n>Our experiments on VGG, ResNet, and ViT architectures demonstrate that models trained with PWL maintain competitive distillation performance and gradually improve accuracy as teacher layers are loaded-matching the final accuracy of the full teacher model.
arXiv Detail & Related papers (2025-09-26T13:19:32Z)
EdgeVLA: Efficient Vision-Language-Action Models [0.4005096060512278]
This paper introduces Edge VLA, a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models.<n>We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs)<n>Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency.
arXiv Detail & Related papers (2025-07-18T16:15:09Z)
SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [70.72227437717467]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z)
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [100.226572152954]
We present an optimized fine-tuning recipe for vision-language-action models (VLAs)<n>Our recipe boosts OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$times$.<n>In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot.
arXiv Detail & Related papers (2025-02-27T00:30:29Z)
Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models [4.737806982257592]
This study proposes a knowledge distillation algorithm based on large language models and feature alignment.<n>The proposed model performs very close to the state-of-the-art GPT-4 model in terms of evaluation indicators such as perplexity, BLEU, ROUGE, and CER.
arXiv Detail & Related papers (2024-12-27T04:37:06Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.