VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
- URL: http://arxiv.org/abs/2510.21817v1
- Date: Tue, 21 Oct 2025 17:59:56 GMT
- Title: VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
- Authors: Xiaoyu Liu, Chaoyou Fu, Chi Yan, Chu Wu, Haihan Gao, Yi-Fan Zhang, Shaoqi Dong, Cheng Qian, Bin Luo, Xiuyong Yang, Guanwu Li, Yusheng Cai, Yunhang Shen, Deqiang Jiang, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He,
- Abstract summary: Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm.<n>VITA-E is a novel embodied interaction framework designed for both behavioral and nearly real-time interruption.
- Score: 66.90028121194636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an ``Active Model'' and a ``Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid platform demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable embodied assistants.
Related papers
- U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation [48.6868174403074]
We introduce U-Mind, the first unified system for high-intelligence multimodal dialogue.<n>It supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop.<n>We show that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks.
arXiv Detail & Related papers (2026-02-27T07:07:02Z) - End-to-end Listen, Look, Speak and Act [22.047534228540783]
ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial intelligence.<n>At its core is a novel SA-MoE (Attention Mixture-of-Experts) that routes each modality to specialized experts fuses them through a unified attention backbone.
arXiv Detail & Related papers (2025-10-19T08:45:46Z) - dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z) - ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving [14.486548540613791]
We introduce ViLaD, a novel Large Vision Language Diffusion framework for end-to-end autonomous driving.<n>ViLaD enables parallel generation of entire driving decision sequences, significantly reducing computational latency.<n>We conduct comprehensive experiments on the nuScenes dataset, where ViLaD outperforms state-of-the-art autoregressive VLM baselines in both planning accuracy and inference speed.
arXiv Detail & Related papers (2025-08-18T04:01:56Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.<n>DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.<n>Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction [5.958765450103163]
We present the QEVD benchmark and dataset, which explores human-AI interaction in the challenging, yet controlled, real-world domain of fitness coaching.<n>The benchmark requires vision-language models to recognize complex human actions, identify possible mistakes, and provide appropriate feedback in real-time.<n>Motivated by this, we propose a simple end-to-end streaming baseline that can respond asynchronously to human actions with appropriate feedback at the appropriate time.
arXiv Detail & Related papers (2024-07-11T00:10:45Z) - MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.
MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z) - Persistent-Transient Duality: A Multi-mechanism Approach for Modeling
Human-Object Interaction [58.67761673662716]
Humans are highly adaptable, swiftly switching between different modes to handle different tasks, situations and contexts.
In Human-object interaction (HOI) activities, these modes can be attributed to two mechanisms: (1) the large-scale consistent plan for the whole activity and (2) the small-scale children interactive actions that start and end along the timeline.
This work proposes to model two concurrent mechanisms that jointly control human motion.
arXiv Detail & Related papers (2023-07-24T12:21:33Z) - VIRT: Improving Representation-based Models for Text Matching through
Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models.
VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.