RoboOmni: Proactive Robot Manipulation in Omni-modal Context
- URL: http://arxiv.org/abs/2510.23763v3
- Date: Sat, 01 Nov 2025 16:33:51 GMT
- Title: RoboOmni: Proactive Robot Manipulation in Omni-modal Context
- Authors: Siyin Wang, Jinlan Fu, Feihong Liu, Xinzhe He, Huangxuan Wu, Junhao Shi, Kexin Huang, Zhaoye Fei, Jingjing Gong, Zuxuan Wu, Yu-Gang Jiang, See-Kiong Ng, Tat-Seng Chua, Xipeng Qiu,
- Abstract summary: We introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands.<n>We present RoboOmni, a framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution.<n>Experiments in simulation and real-world settings show that Robo Omni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
- Score: 165.09049429566238
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision-Language-Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance.
Related papers
- Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM [37.09992708719033]
This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding.<n>The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps.
arXiv Detail & Related papers (2025-11-21T15:55:25Z) - TalkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models [1.534667887016089]
The presented paper investigates recent advancements in Large Language Models (LLMs) and Vision Language Models (VLMs)<n>This integration allows robots to understand and execute commands given in natural language and to perceive their environment through visual and/or descriptive inputs.<n>Our paper outlines four LLM-assisted simulated robotic control, which explore (i) low-level control, (ii) the generation of language-based feedback that describes the robot's internal states, (iii) the use of visual information as additional input, and (iv) the use of robot structure information for generating task plans and feedback.
arXiv Detail & Related papers (2024-12-19T23:43:40Z) - In-Context Learning Enables Robot Action Prediction in LLMs [52.285739178561705]
We introduce RoboPrompt, a framework that enables offthe-shelf text-only Large Language Models to directly predict robot actions.<n>RoboPrompt shows stronger performance over zero-shot and ICL baselines in simulated and real-world settings.
arXiv Detail & Related papers (2024-10-16T17:56:49Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - Incremental Learning of Humanoid Robot Behavior from Natural Interaction and Large Language Models [23.945922720555146]
We propose a system to achieve incremental learning of complex behavior from natural interaction.
We integrate the system in the robot cognitive architecture of the humanoid robot ARMAR-6.
arXiv Detail & Related papers (2023-09-08T13:29:05Z) - WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system.
We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration.
We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z) - Chat with the Environment: Interactive Multimodal Perception Using Large
Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning.
Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.