Related papers: Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation

Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation

URL: http://arxiv.org/abs/2510.24055v1
Date: Tue, 28 Oct 2025 04:27:03 GMT
Title: Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation
Authors: Xiucheng Zhang, Yang Jiang, Hongwei Qing, Jiashuo Bai,
Abstract summary: We propose a framework combining a Language-Conditioned Visual Representation (LCVR) module and a Language-conditioned Mixture-of-Experts Density Policy (LMoE-DP)<n>On real-robot benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion Policy (DP) success rates by 33.75% and 25%, respectively.<n>Our work shows that combining semantic grounding and expert specialization enables robust, efficient multi-task manipulation.
Score: 1.731102560795011
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Perceptual ambiguity and task conflict limit multitask robotic manipulation via imitation learning. We propose a framework combining a Language-Conditioned Visual Representation (LCVR) module and a Language-conditioned Mixture-ofExperts Density Policy (LMoE-DP). LCVR resolves perceptual ambiguities by grounding visual features with language instructions, enabling differentiation between visually similar tasks. To mitigate task conflict, LMoE-DP uses a sparse expert architecture to specialize in distinct, multimodal action distributions, stabilized by gradient modulation. On real-robot benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion Policy (DP) success rates by 33.75% and 25%, respectively. The full framework achieves a 79% average success, outperforming the advanced baseline by 21%. Our work shows that combining semantic grounding and expert specialization enables robust, efficient multi-task manipulation

Related papers

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation [83.75249714794977]
We present Crab$+$, a scalable and unified audio-visual scene understanding model.<n>On the data side, we introduce AV-UIE v2, a comprehensive Audio-Visual Unified Instruction-tuning dataset.<n>On the model side, we design a unified interface to align heterogeneous task formulations.<n>We successfully reverse the negative transfer trend, achieving positive transfer where multi-task learning surpasses single-task baselines in nearly 88% of tasks.
arXiv Detail & Related papers (2026-03-04T14:43:57Z)
Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining [59.2578488860426]
Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors.<n>Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning.<n>We propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning.
arXiv Detail & Related papers (2026-03-02T11:38:12Z)
MM-ACT: Learn from Multimodal Parallel Generation to Act [80.9182259389658]
MM-ACT integrates text, image, and action in shared token space and performs generation across all three modalities.<n> Context-Shared Multimodal Learning supervises generation in all three modalities from a shared context.<n>Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0.
arXiv Detail & Related papers (2025-11-30T16:46:35Z)
dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z)
Objective Soups: Multilingual Multi-Task Modeling for Speech Processing [69.52720282028385]
Training a single model for multilingual, multi-task speech processing (MSP) is severely hampered by conflicting objectives between tasks.<n>This paper investigates three multi-objective MSP formulations, which we refer to as textbfobjective soup recipes.<n>Our work demonstrates that hierarchical MOO is a more effective and scalable approach for building state-of-the-art MSP models.
arXiv Detail & Related papers (2025-08-12T07:01:09Z)
Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control [22.74768543283102]
Graph-Fused Vision-Language-Action (GF-VLA) is a framework that enables dual-arm robotic systems to perform task-level reasoning and execution.<n>GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance.<n>Cross-hand selection policy infers optimal assignment without explicit geometric reasoning.
arXiv Detail & Related papers (2025-08-07T12:48:09Z)
MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories [4.503215272392276]
We propose MORAL (a multimodal reinforcement learning framework for decision making in autonomous laboratories)<n>We generate fine-tuned image captions with a pretrained BLIP-2 vision-language model and combine them with visual features through an early fusion strategy.<n> Experimental results demonstrate that multimodal agents achieve a 20% improvement in task completion rates.
arXiv Detail & Related papers (2025-04-04T04:15:52Z)
ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model [21.844214660424175]
ChatVLA is a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference.<n>ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art vision-language-action (VLA) methods on multimodal understanding benchmarks.<n>Our findings highlight the potential of our unified framework for achieving both robust multimodal understanding and effective robot control.
arXiv Detail & Related papers (2025-02-20T10:16:18Z)
GRAPE: Generalizing Robot Policy via Preference Alignment [58.419992317452376]
We present GRAPE: Generalizing Robot Policy via Preference Alignment.<n>We show GRAPE increases success rates on in-domain and unseen manipulation tasks by 51.79% and 58.20%, respectively.<n> GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 37.44% and rollout step-length by 11.15%, respectively.
arXiv Detail & Related papers (2024-11-28T18:30:10Z)
Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation [14.354318744503088]
We present Sigma-Agent, an end-to-end imitation learning agent for multi-task robotic manipulation. Sigma-Agent incorporates contrastive Imitation Learning (contrastive IL) modules to strengthen vision-language and current-future representations. Sigma-Agent shows substantial improvement over state-of-the-art methods under diverse settings.
arXiv Detail & Related papers (2024-06-14T05:53:00Z)
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z)
Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z)
Multi-Level Compositional Reasoning for Interactive Instruction Following [24.581542880280203]
Multi-level Compositional Reasoning Agent (MCR-Agent) At the highest level, we infer a sequence of human-interpretable subgoals to be executed based on language instructions by a high-level policy composition controller. At the middle level, we discriminatively control the agent's navigation by a master policy by alternating between a navigation policy and various independent interaction policies. At the lowest level, we infer manipulation actions with the corresponding object masks using the appropriate interaction policy.
arXiv Detail & Related papers (2023-08-18T08:38:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.