Related papers: EmboMatrix: A Scalable Training-Ground for Embodied Decision-Making

EmboMatrix: A Scalable Training-Ground for Embodied Decision-Making

URL: http://arxiv.org/abs/2510.12072v1
Date: Tue, 14 Oct 2025 02:26:52 GMT
Title: EmboMatrix: A Scalable Training-Ground for Embodied Decision-Making
Authors: Zixing Lei, Sheng Yin, Yichen Xiong, Yuanzhuo Ding, Wenhao Huang, Yuxi Wei, Qingyao Xu, Yiming Li, Weixin Li, Yunhong Wang, Siheng Chen,
Abstract summary: Embodied decision-making enables agents to translate high-level goals into executable actions through continuous interactions within the physical world.<n>Large language models (LLMs) with their general decision-making capabilities offer a promising path to realize this potential.<n>We propose the concept of a training ground: a comprehensive infrastructure that provides task and scene simulation, embodied interaction, and feedback signals.
Score: 60.15832211188291
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied decision-making enables agents to translate high-level goals into executable actions through continuous interactions within the physical world, forming a cornerstone of general-purpose embodied intelligence. Large language models (LLMs), with their general decision-making capabilities, offer a promising path to realize this potential; however, LLMs trained solely on language lack exposure to physical environments, limiting their true embodied understanding. To bridge this gap, we propose the concept of a training ground: a comprehensive infrastructure that provides task and scene simulation, embodied interaction, and feedback signals, offering a one-stop solution for LLM acquire genuine embodied decision-making skills. In this work, we present EmboMatrix, the first training ground of its kind, providing massive and diverse tasks with efficient simulation and precise rewards. EmboMatrix incorporates a series of novel techniques: a multi-agent data engine for large-scale task and scene generation, a distributed heterogeneous-hardware system for scalable simulation, and a multi-level reward architecture for precise supervision. Leveraging EmboMatrix, we cultivate EmboBrain, an LLM whose embodied decision-making abilities emerge from extensive embodied interactions. Experiments show that EmboBrain-7B surpasses the 671B DeepSeek-R1 baseline by 9.5\% on two challenging embodied decision-making benchmarks, demonstrating the power of interactive, environment-grounded learning for building truly intelligent embodied agents.

Related papers

Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs [72.08224879435762]
textttLearn-to-Ask is a simulator-free framework for learning and deploying proactive dialogue agents.<n>Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service.
arXiv Detail & Related papers (2025-10-29T12:08:07Z)
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities [61.173773299032746]
Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world.<n>We introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities.<n> BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning.<n>We propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities.
arXiv Detail & Related papers (2025-10-09T19:18:36Z)
Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation [89.5123417007126]
We show how to make Large Multimodal Models (LMMs) understand the spatial action space.<n>We also show how to fully exploit the reasoning capacity of LMMs in solving these tasks.<n>Our resulting reasoning model built upon a 7B backbone, named ReasonManip, demonstrates three notable advantages.
arXiv Detail & Related papers (2025-05-19T06:00:14Z)
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z)
GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill [25.686589649523587]
Learning open-vocabulary physical skills for simulated agents presents a significant challenge in artificial intelligence.<n>We introduce GROVE, a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations.<n>To bridge the domain gap between simulation and natural images, we develop Pose2CLIP, a lightweight mapper that efficiently projects agent poses directly into semantic feature space.
arXiv Detail & Related papers (2025-04-05T14:44:47Z)
MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories [4.503215272392276]
We propose MORAL (a multimodal reinforcement learning framework for decision making in autonomous laboratories)<n>We generate fine-tuned image captions with a pretrained BLIP-2 vision-language model and combine them with visual features through an early fusion strategy.<n> Experimental results demonstrate that multimodal agents achieve a 20% improvement in task completion rates.
arXiv Detail & Related papers (2025-04-04T04:15:52Z)
MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation [24.200547898713126]
Multimodal Large Language Models (MLLMs) excel in understanding complex language and visual data.<n>Their real-world deployment is hindered by substantial computational and storage demands.<n>We propose a Mixture-of-Layers Vision-Language-Action model (MoLe) architecture for dynamic LLM layer activation.
arXiv Detail & Related papers (2025-03-26T10:05:38Z)
Instruction-Augmented Long-Horizon Planning: Embedding Grounding Mechanisms in Embodied Mobile Manipulation [39.43049944895508]
We present the Instruction-Augmented Long-Horizon Planning (IALP) system, which generates feasible and optimal actions based on real-time sensor feedback.<n>Our results demonstrate that the IALP system can efficiently solve tasks with an average success rate exceeding 80%.
arXiv Detail & Related papers (2025-03-11T06:37:33Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
Latent-Predictive Empowerment: Measuring Empowerment without a Simulator [56.53777237504011]
We present Latent-Predictive Empowerment (LPE), an algorithm that can compute empowerment in a more practical manner. LPE learns large skillsets by maximizing an objective that is a principled replacement for the mutual information between skills and states.
arXiv Detail & Related papers (2024-10-15T00:41:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.