Bridging Speech, Emotion, and Motion: a VLM-based Multimodal Edge-deployable Framework for Humanoid Robots
- URL: http://arxiv.org/abs/2602.07434v1
- Date: Sat, 07 Feb 2026 08:32:54 GMT
- Title: Bridging Speech, Emotion, and Motion: a VLM-based Multimodal Edge-deployable Framework for Humanoid Robots
- Authors: Songhua Yang, Xuetao Li, Xuanye Fei, Mengde Li, Miao Li,
- Abstract summary: We present textitSeM$2$, a Vision Language Model-based framework that orchestrates emotionally coherent multimodal interactions.<n>We implement both cloud-based and underlinetextitedge-deployed versions (textitSeM$2_e$), with the latter knowledge distilled to operate efficiently on edge hardware.<n> Comprehensive evaluations demonstrate that our approach significantly outperforms unimodal baselines in naturalness, emotional clarity, and modal coherence.
- Score: 7.665995147018354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective human-robot interaction requires emotionally rich multimodal expressions, yet most humanoid robots lack coordinated speech, facial expressions, and gestures. Meanwhile, real-world deployment demands on-device solutions that can operate autonomously without continuous cloud connectivity. To bridging \underline{\textit{S}}peech, \underline{\textit{E}}motion, and \underline{\textit{M}}otion, we present \textit{SeM$^2$}, a Vision Language Model-based framework that orchestrates emotionally coherent multimodal interactions through three key components: a multimodal perception module capturing user contextual cues, a Chain-of-Thought reasoning for response planning, and a novel Semantic-Sequence Aligning Mechanism (SSAM) that ensures precise temporal coordination between verbal content and physical expressions. We implement both cloud-based and \underline{\textit{e}}dge-deployed versions (\textit{SeM$^2_e$}), with the latter knowledge distilled to operate efficiently on edge hardware while maintaining 95\% of the relative performance. Comprehensive evaluations demonstrate that our approach significantly outperforms unimodal baselines in naturalness, emotional clarity, and modal coherence, advancing socially expressive humanoid robotics for diverse real-world environments.
Related papers
- MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z) - U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation [48.6868174403074]
We introduce U-Mind, the first unified system for high-intelligence multimodal dialogue.<n>It supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop.<n>We show that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks.
arXiv Detail & Related papers (2026-02-27T07:07:02Z) - Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation [90.90219129619344]
This paper presents a novel R-prior-S, Recurrent Geometric-priormodal Policy with Spiking features.<n>To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases.<n>For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network.
arXiv Detail & Related papers (2026-01-13T23:36:30Z) - End-to-end Listen, Look, Speak and Act [22.047534228540783]
ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial intelligence.<n>At its core is a novel SA-MoE (Attention Mixture-of-Experts) that routes each modality to specialized experts fuses them through a unified attention backbone.
arXiv Detail & Related papers (2025-10-19T08:45:46Z) - FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z) - AIVA: An AI-based Virtual Companion for Emotion-aware Interaction [10.811567597962453]
ours is an AI-based virtual companion that captures multimodal sentiment cues.<n>ours provides a framework for emotion-aware agents with applications in companion robotics, social care, mental health, and human-centered AI.
arXiv Detail & Related papers (2025-09-03T11:00:46Z) - OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation [29.41106195298283]
Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence.<n>textbfwe propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.
arXiv Detail & Related papers (2025-08-26T17:15:26Z) - Enhancing Explainability with Multimodal Context Representations for Smarter Robots [0.0]
Key issue in Human-Robot Interaction is enabling robots to effectively perceive and reason over multimodal inputs, such as audio and vision.<n>We propose a generalized and explainable multimodal framework for context representation, designed to improve the fusion of speech and vision modalities.
arXiv Detail & Related papers (2025-02-28T13:36:47Z) - Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding [85.63710017456792]
FuSe is a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities.<n>We show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound.<n>Experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.
arXiv Detail & Related papers (2025-01-08T18:57:33Z) - When Words Smile: Generating Diverse Emotional Facial Expressions from Text [77.1867389815291]
We introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics.<n>Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent.
arXiv Detail & Related papers (2024-12-03T15:39:05Z) - RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation.
We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language.
We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z) - Chat with the Environment: Interactive Multimodal Perception Using Large
Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning.
Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.