Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning
- URL: http://arxiv.org/abs/2510.14300v1
- Date: Thu, 16 Oct 2025 04:52:57 GMT
- Title: Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning
- Authors: Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, Xinping Guan, Xiaokang Yang, Yao Mu,
- Abstract summary: AdaMoE is a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models.<n>A substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.
- Score: 56.129822832095726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, We propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models, and scales up the action expert by substituting the feedforward layers into sparsely activated MoE layers. AdaMoE employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This enables experts to be selected based on task relevance while contributing with independently controlled weights, allowing collaborative expert utilization rather than winner-takes-all dynamics. Our approach demonstrates that expertise need not monopolize. Instead, through collaborative expert utilization, we can achieve superior performance while maintaining computational efficiency. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8% on LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.
Related papers
- Sample-Efficient Robot Skill Learning for Construction Tasks: Benchmarking Hierarchical Reinforcement Learning and Vision-Language-Action VLA Model [9.025728945376468]
This study evaluates two leading approaches for teaching construction robots new skills.<n>The goal is to understand both task performance and the practical effort needed to deploy each approach on real jobs.
arXiv Detail & Related papers (2025-12-16T02:56:13Z) - Towards Accessible Physical AI: LoRA-Based Fine-Tuning of VLA Models for Real-World Robot Control [0.0]
This paper presents an efficient fine-tuning methodology and real-world deployment analysis for adapting VLA models to low-cost robotic manipulation systems.<n>We propose a resource-efficient fine-tuning strategy using Low-Rank Adaptation (LoRA) and quantization techniques.<n>Our methodology addresses the critical challenge of adapting pre-trained VLA models to new robot embodiments with limited demonstration data.
arXiv Detail & Related papers (2025-12-11T16:25:30Z) - VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation [76.13140980997508]
Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs)<n>We propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models.<n>In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement)
arXiv Detail & Related papers (2025-10-10T17:59:56Z) - MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models [52.876185634349575]
We propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to Large Vision-Language Models (LVLMs)<n>For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts.<n>Our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models.
arXiv Detail & Related papers (2025-08-13T13:00:05Z) - FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation [11.080979029271019]
Vision-language-action (VLA) models have significantly advanced robotic manipulation by enabling robots to interpret language instructions for task execution.<n>We propose FedVLA, the first federated VLA learning framework, enabling distributed model training that preserves data privacy without compromising performance.
arXiv Detail & Related papers (2025-08-04T08:39:43Z) - MoSE: Skill-by-Skill Mixture-of-Experts Learning for Embodied Autonomous Machines [14.042949333988785]
We introduce a novel Mixture-of-Expert (MoE) method that significantly boosts reasoning and learning efficiency for embodied AI.<n>General MoE models demand extensive training data and complex optimization, which limits their applicability in embodied AI.<n>We propose a skill-oriented MoE called MoSE, which mimics the human learning and reasoning process skill-by-skill, step-by-step.
arXiv Detail & Related papers (2025-07-10T14:48:08Z) - Is Diversity All You Need for Scalable Robotic Manipulation? [50.747150672933316]
We investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better"<n>We show that task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios.<n>We propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data.
arXiv Detail & Related papers (2025-07-08T17:52:44Z) - MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models [34.138699712315]
This paper introduces a novel vision--action (VLA) model, mixture of robotic experts (MoRE) for quadruped robots.<n>MoRE integrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model.<n>Experiments demonstrate that MoRE outperforms all baselines across six different skills and exhibits superior generalization capabilities in out-of-distribution scenarios.
arXiv Detail & Related papers (2025-03-11T03:13:45Z) - Improving Vision-Language-Action Model with Online Reinforcement Learning [17.043068379668842]
Recent studies have successfully integrated large vision-language models into low-level robotic control by supervised fine-tuning.<n>We propose iRe-VLA framework, which iterates between Reinforcement Learning and Supervised Learning to effectively improve VLA models.
arXiv Detail & Related papers (2025-01-28T02:53:48Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z) - Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models.
Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.