FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation
- URL: http://arxiv.org/abs/2508.02190v1
- Date: Mon, 04 Aug 2025 08:39:43 GMT
- Title: FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation
- Authors: Cui Miao, Tao Chang, Meihan Wu, Hongbin Xu, Chun Li, Ming Li, Xiaodong Wang,
- Abstract summary: Vision-language-action (VLA) models have significantly advanced robotic manipulation by enabling robots to interpret language instructions for task execution.<n>We propose FedVLA, the first federated VLA learning framework, enabling distributed model training that preserves data privacy without compromising performance.
- Score: 11.080979029271019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language-action (VLA) models have significantly advanced robotic manipulation by enabling robots to interpret language instructions for task execution. However, training these models often relies on large-scale user-specific data, raising concerns about privacy and security, which in turn limits their broader adoption. To address this, we propose FedVLA, the first federated VLA learning framework, enabling distributed model training that preserves data privacy without compromising performance. Our framework integrates task-aware representation learning, adaptive expert selection, and expert-driven federated aggregation, enabling efficient and privacy-preserving training of VLA models. Specifically, we introduce an Instruction Oriented Scene-Parsing mechanism, which decomposes and enhances object-level features based on task instructions, improving contextual understanding. To effectively learn diverse task patterns, we design a Dual Gating Mixture-of-Experts (DGMoE) mechanism, where not only input tokens but also self-aware experts adaptively decide their activation. Finally, we propose an Expert-Driven Aggregation strategy at the federated server, where model aggregation is guided by activated experts, ensuring effective cross-client knowledge transfer.Extensive simulations and real-world robotic experiments demonstrate the effectiveness of our proposals. Notably, DGMoE significantly improves computational efficiency compared to its vanilla counterpart, while FedVLA achieves task success rates comparable to centralized training, effectively preserving data privacy.
Related papers
- ActionCodec: What Makes for Good Action Tokenizers [106.78093973045526]
Vision-Language-Action (VLA) models have demonstrated superior instruction-following and training efficiency.<n>Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity.<n>We introduce textbfActionCodec, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance.
arXiv Detail & Related papers (2026-02-17T07:07:15Z) - iFlyBot-VLA Technical Report [25.330744626382977]
We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework.<n>The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; and (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets.
arXiv Detail & Related papers (2025-11-01T06:24:56Z) - Learning To Defer To A Population With Limited Demonstrations [13.40222956306532]
This paper addresses the critical data scarcity that hinders the practical deployment of learning to defer (L2D) systems to the population.<n>We introduce a context-aware, semi-supervised framework that uses meta-learning to generate expert-specific embeddings from only a few demonstrations.
arXiv Detail & Related papers (2025-10-22T08:18:02Z) - Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning [56.129822832095726]
AdaMoE is a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models.<n>A substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.
arXiv Detail & Related papers (2025-10-16T04:52:57Z) - FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks [45.65159253753118]
This work explores how agents trained with imitation learning can learn robust representations from both optimal and suboptimal demonstrations.<n>We provide language feedback embeddings as part of the input sequence into a Transformer-based policy.<n>We test our approach on a range of embodied Vision-and-Language tasks in our custom BabyAI-XGen environment.
arXiv Detail & Related papers (2025-10-13T11:55:21Z) - InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation [43.83789393525928]
InstructVLA is an end-to-end vision-language model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance.<n>InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation.<n>On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA.
arXiv Detail & Related papers (2025-07-23T13:57:06Z) - UniVLA: Learning to Act Anywhere with Task-centric Latent Actions [32.83715417294052]
UniVLA is a new framework for learning cross-embodiment vision-language-action (VLA) policies.<n>We derive task-centric action representations from videos with a latent action model.<n>We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments.
arXiv Detail & Related papers (2025-05-09T15:11:13Z) - Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset [92.99416966226724]
We introduce Facial Identity Unlearning Benchmark (FIUBench), a novel VLM unlearning benchmark designed to robustly evaluate the effectiveness of unlearning algorithms.<n>We apply a two-stage evaluation pipeline that is designed to precisely control the sources of information and their exposure levels.<n>Through the evaluation of four baseline VLM unlearning algorithms within FIUBench, we find that all methods remain limited in their unlearning performance.
arXiv Detail & Related papers (2024-11-05T23:26:10Z) - Federated Learning for Face Recognition via Intra-subject Self-supervised Learning [3.9899461012388504]
We propose FedFS (Federated Learning for personalized Face recognition via intra-subject Self-supervised learning framework) to train personalized face recognition models without imposing subjects.
FedFS comprises two crucial components that leverage aggregated features of the local and global models to cooperate with representations of an off-the-shelf model.
We conduct comprehensive experiments on the DigiFace-1M and VGGFace datasets, demonstrating superior performance compared to previous methods.
arXiv Detail & Related papers (2024-07-23T08:43:42Z) - Task-agnostic Decision Transformer for Multi-type Agent Control with Federated Split Training [34.80971707794908]
Federated Split Decision Transformer (FSDT) is an innovative framework designed explicitly for AI agent decision tasks.
FSDT framework excels at navigating the intricacies of personalized agents by harnessing distributed data for training.
Our findings underscore the efficacy of the FSDT framework in effectively leveraging distributed offline reinforcement learning data to enable powerful multi-type agent decision systems.
arXiv Detail & Related papers (2024-05-22T08:37:37Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - FedLALR: Client-Specific Adaptive Learning Rates Achieve Linear Speedup
for Non-IID Data [54.81695390763957]
Federated learning is an emerging distributed machine learning method.
We propose a heterogeneous local variant of AMSGrad, named FedLALR, in which each client adjusts its learning rate.
We show that our client-specified auto-tuned learning rate scheduling can converge and achieve linear speedup with respect to the number of clients.
arXiv Detail & Related papers (2023-09-18T12:35:05Z) - Expanding Frozen Vision-Language Models without Retraining: Towards
Improved Robot Perception [0.0]
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks.
In this paper, we demonstrate a method of aligning the embedding spaces of different modalities to the vision embedding space.
We show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks.
arXiv Detail & Related papers (2023-08-31T06:53:55Z) - Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z) - Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less
Forgetting [66.45372974713189]
We propose a recall and learn mechanism, which adopts the idea of multi-task learning and jointly learns pretraining tasks and downstream tasks.
Experiments show that our method achieves state-of-the-art performance on the GLUE benchmark.
We provide open-source RecAdam, which integrates the proposed mechanisms into Adam to facility the NLP community.
arXiv Detail & Related papers (2020-04-27T08:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.