Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation
- URL: http://arxiv.org/abs/2510.24055v1
- Date: Tue, 28 Oct 2025 04:27:03 GMT
- Title: Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation
- Authors: Xiucheng Zhang, Yang Jiang, Hongwei Qing, Jiashuo Bai,
- Abstract summary: We propose a framework combining a Language-Conditioned Visual Representation (LCVR) module and a Language-conditioned Mixture-of-Experts Density Policy (LMoE-DP)<n>On real-robot benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion Policy (DP) success rates by 33.75% and 25%, respectively.<n>Our work shows that combining semantic grounding and expert specialization enables robust, efficient multi-task manipulation.
- Score: 1.731102560795011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Perceptual ambiguity and task conflict limit multitask robotic manipulation via imitation learning. We propose a framework combining a Language-Conditioned Visual Representation (LCVR) module and a Language-conditioned Mixture-ofExperts Density Policy (LMoE-DP). LCVR resolves perceptual ambiguities by grounding visual features with language instructions, enabling differentiation between visually similar tasks. To mitigate task conflict, LMoE-DP uses a sparse expert architecture to specialize in distinct, multimodal action distributions, stabilized by gradient modulation. On real-robot benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion Policy (DP) success rates by 33.75% and 25%, respectively. The full framework achieves a 79% average success, outperforming the advanced baseline by 21%. Our work shows that combining semantic grounding and expert specialization enables robust, efficient multi-task manipulation
Related papers
- Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation [83.75249714794977]
We present Crab$+$, a scalable and unified audio-visual scene understanding model.<n>On the data side, we introduce AV-UIE v2, a comprehensive Audio-Visual Unified Instruction-tuning dataset.<n>On the model side, we design a unified interface to align heterogeneous task formulations.<n>We successfully reverse the negative transfer trend, achieving positive transfer where multi-task learning surpasses single-task baselines in nearly 88% of tasks.
arXiv Detail & Related papers (2026-03-04T14:43:57Z) - Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining [59.2578488860426]
Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors.<n>Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning.<n>We propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning.
arXiv Detail & Related papers (2026-03-02T11:38:12Z) - MM-ACT: Learn from Multimodal Parallel Generation to Act [80.9182259389658]
MM-ACT integrates text, image, and action in shared token space and performs generation across all three modalities.<n> Context-Shared Multimodal Learning supervises generation in all three modalities from a shared context.<n>Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0.
arXiv Detail & Related papers (2025-11-30T16:46:35Z) - dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z) - Objective Soups: Multilingual Multi-Task Modeling for Speech Processing [69.52720282028385]
Training a single model for multilingual, multi-task speech processing (MSP) is severely hampered by conflicting objectives between tasks.<n>This paper investigates three multi-objective MSP formulations, which we refer to as textbfobjective soup recipes.<n>Our work demonstrates that hierarchical MOO is a more effective and scalable approach for building state-of-the-art MSP models.
arXiv Detail & Related papers (2025-08-12T07:01:09Z) - Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control [22.74768543283102]
Graph-Fused Vision-Language-Action (GF-VLA) is a framework that enables dual-arm robotic systems to perform task-level reasoning and execution.<n>GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance.<n>Cross-hand selection policy infers optimal assignment without explicit geometric reasoning.
arXiv Detail & Related papers (2025-08-07T12:48:09Z) - MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories [4.503215272392276]
We propose MORAL (a multimodal reinforcement learning framework for decision making in autonomous laboratories)<n>We generate fine-tuned image captions with a pretrained BLIP-2 vision-language model and combine them with visual features through an early fusion strategy.<n> Experimental results demonstrate that multimodal agents achieve a 20% improvement in task completion rates.
arXiv Detail & Related papers (2025-04-04T04:15:52Z) - ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model [21.844214660424175]
ChatVLA is a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference.<n>ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art vision-language-action (VLA) methods on multimodal understanding benchmarks.<n>Our findings highlight the potential of our unified framework for achieving both robust multimodal understanding and effective robot control.
arXiv Detail & Related papers (2025-02-20T10:16:18Z) - GRAPE: Generalizing Robot Policy via Preference Alignment [58.419992317452376]
We present GRAPE: Generalizing Robot Policy via Preference Alignment.<n>We show GRAPE increases success rates on in-domain and unseen manipulation tasks by 51.79% and 58.20%, respectively.<n> GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 37.44% and rollout step-length by 11.15%, respectively.
arXiv Detail & Related papers (2024-11-28T18:30:10Z) - Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation [14.354318744503088]
We present Sigma-Agent, an end-to-end imitation learning agent for multi-task robotic manipulation.
Sigma-Agent incorporates contrastive Imitation Learning (contrastive IL) modules to strengthen vision-language and current-future representations.
Sigma-Agent shows substantial improvement over state-of-the-art methods under diverse settings.
arXiv Detail & Related papers (2024-06-14T05:53:00Z) - SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.
Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images.
We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - Multi-Level Compositional Reasoning for Interactive Instruction
Following [24.581542880280203]
Multi-level Compositional Reasoning Agent (MCR-Agent)
At the highest level, we infer a sequence of human-interpretable subgoals to be executed based on language instructions by a high-level policy composition controller.
At the middle level, we discriminatively control the agent's navigation by a master policy by alternating between a navigation policy and various independent interaction policies.
At the lowest level, we infer manipulation actions with the corresponding object masks using the appropriate interaction policy.
arXiv Detail & Related papers (2023-08-18T08:38:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.