MoSE: Skill-by-Skill Mixture-of-Experts Learning for Embodied Autonomous Machines
- URL: http://arxiv.org/abs/2507.07818v2
- Date: Wed, 13 Aug 2025 13:45:40 GMT
- Title: MoSE: Skill-by-Skill Mixture-of-Experts Learning for Embodied Autonomous Machines
- Authors: Lu Xu, Jiaqian Yu, Xiongfeng Peng, Yiwei Chen, Weiming Li, Jaewook Yoo, Sunghyun Chunag, Dongwook Lee, Daehyun Ji, Chao Zhang,
- Abstract summary: We introduce a novel Mixture-of-Expert (MoE) method that significantly boosts reasoning and learning efficiency for embodied AI.<n>General MoE models demand extensive training data and complex optimization, which limits their applicability in embodied AI.<n>We propose a skill-oriented MoE called MoSE, which mimics the human learning and reasoning process skill-by-skill, step-by-step.
- Score: 14.042949333988785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To meet the growing demand for smarter, faster, and more efficient embodied AI solutions, we introduce a novel Mixture-of-Expert (MoE) method that significantly boosts reasoning and learning efficiency for embodied autonomous systems. General MoE models demand extensive training data and complex optimization, which limits their applicability in embodied AI such as autonomous driving (AD) and robotic manipulation. In this work, we propose a skill-oriented MoE called MoSE, which mimics the human learning and reasoning process skill-by-skill, step-by-step. We introduce a skill-oriented routing mechanism that begins with defining and annotating specific skills, enabling experts to identify the necessary competencies for various scenarios and reasoning tasks, thereby facilitating skill-by-skill learning. To better align with multi-step planning in human reasoning and in end-to-end driving models, we build a hierarchical skill dataset and pretrain the router to encourage the model to think step-by-step. Unlike other multi-round dialogues, MoSE integrates valuable auxiliary tasks (e.g. perception-prediction-planning for AD, and high-level and low-level planning for robots) in one single forward process without introducing any extra computational cost. With less than 3B sparsely activated parameters, our model effectively grows more diverse expertise and outperforms models on both AD corner-case reasoning tasks and robot reasoning tasks with less than 40% of the parameters.
Related papers
- Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning [49.82882141491629]
We argue that effective online learning should scale the emphnumber of tasks, rather than the number of samples per task.<n>This regime reveals a structural advantage of model-based reinforcement learning.<n>We instantiate this idea with textbfEfficientZero-Multitask (EZ-M), a sample-efficient multi-task algorithm for online learning.
arXiv Detail & Related papers (2026-03-02T05:07:43Z) - Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning [56.129822832095726]
AdaMoE is a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models.<n>A substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.
arXiv Detail & Related papers (2025-10-16T04:52:57Z) - SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents [93.26456498576181]
This paper focuses on the development of native Autonomous Single-Agent models for Deep Research.<n>Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark.
arXiv Detail & Related papers (2025-09-08T02:07:09Z) - Multi-modal Knowledge Distillation-based Human Trajectory Forecasting [35.060041571520024]
Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation.<n>In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy.<n>We propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities.
arXiv Detail & Related papers (2025-03-28T07:32:51Z) - MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models [34.138699712315]
This paper introduces a novel vision--action (VLA) model, mixture of robotic experts (MoRE) for quadruped robots.<n>MoRE integrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model.<n>Experiments demonstrate that MoRE outperforms all baselines across six different skills and exhibits superior generalization capabilities in out-of-distribution scenarios.
arXiv Detail & Related papers (2025-03-11T03:13:45Z) - LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - Latent-Predictive Empowerment: Measuring Empowerment without a Simulator [56.53777237504011]
We present Latent-Predictive Empowerment (LPE), an algorithm that can compute empowerment in a more practical manner.
LPE learns large skillsets by maximizing an objective that is a principled replacement for the mutual information between skills and states.
arXiv Detail & Related papers (2024-10-15T00:41:18Z) - MoExtend: Tuning New Experts for Modality and Task Extension [61.29100693866109]
MoExtend is an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models.
MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models.
arXiv Detail & Related papers (2024-08-07T02:28:37Z) - Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - Practice Makes Perfect: Planning to Learn Skill Parameter Policies [34.51008914846429]
In this work, we focus on the active learning problem of choosing which skills to practice to maximize expected future task success.
We propose that the robot should estimate the competence of each skill, extrapolate the competence, and situate the skill in the task distribution through competence-aware planning.
This approach is implemented within a fully autonomous system where the robot repeatedly plans, practices, and learns without any environment resets.
arXiv Detail & Related papers (2024-02-22T23:58:26Z) - Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - A Central Motor System Inspired Pre-training Reinforcement Learning for Robotic Control [7.227887302864789]
We propose CMS-PRL, a pre-training reinforcement learning method inspired by the Central Motor System.
First, we introduce a fusion reward mechanism that combines the basic motor reward with mutual information reward.
Second, we design a skill encoding method inspired by the motor program of the basal ganglia, providing rich and continuous skill instructions.
Third, we propose a skill activity function to regulate motor skill activity, enabling the generation of skills with different activity levels.
arXiv Detail & Related papers (2023-11-14T00:49:12Z) - RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation [68.70755196744533]
RoboGen is a generative robotic agent that automatically learns diverse robotic skills at scale via generative simulation.
Our work attempts to extract the extensive and versatile knowledge embedded in large-scale models and transfer them to the field of robotics.
arXiv Detail & Related papers (2023-11-02T17:59:21Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Skill-based Model-based Reinforcement Learning [18.758245582997656]
Model-based reinforcement learning (RL) is a sample-efficient way of learning complex behaviors.
We propose a Skill-based Model-based RL framework (SkiMo) that enables planning in the skill space.
We harness the learned skill dynamics model to accurately simulate and plan over long horizons in the skill space.
arXiv Detail & Related papers (2022-07-15T16:06:33Z) - Skill-based Multi-objective Reinforcement Learning of Industrial Robot
Tasks with Planning and Knowledge Integration [0.4949816699298335]
We introduce an approach that provides a combination of task-level planning with targeted learning of scenario-specific parameters for skill-based systems.
We demonstrate the efficacy and versatility of our approach by learning skill parameters for two different contact-rich tasks.
arXiv Detail & Related papers (2022-03-18T16:03:27Z) - Robot Skill Adaptation via Soft Actor-Critic Gaussian Mixture Models [29.34375999491465]
A core challenge for an autonomous agent acting in the real world is to adapt its repertoire of skills to cope with its noisy perception and dynamics.
To scale learning of skills to long-horizon tasks, robots should be able to learn and later refine their skills in a structured manner.
We proposeSAC-GMM, a novel hybrid approach that learns robot skills through a dynamical system and adapts the learned skills in their own trajectory distribution space.
arXiv Detail & Related papers (2021-11-25T15:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.