Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments
- URL: http://arxiv.org/abs/2603.01804v1
- Date: Mon, 02 Mar 2026 12:38:43 GMT
- Title: Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments
- Authors: Dragos Costea, Alina Marcu, Cristina Lazar, Marius Leordeanu,
- Abstract summary: We study the debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion.<n>We introduce the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints.<n>Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.
- Score: 6.623088068354071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.
Related papers
- MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z) - EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents [85.77432303199176]
We propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones.<n>Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes.<n>Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via
arXiv Detail & Related papers (2026-02-26T16:53:41Z) - Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis [51.95817740348585]
Human-X is a novel framework designed to enable immersive and physically plausible human interactions across diverse entities.<n>Our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner.<n>Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction.
arXiv Detail & Related papers (2025-08-04T06:35:48Z) - ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation [17.438484695828276]
We present ZeroHSI, a novel approach that enables zero-shot 4D human-scene interaction synthesis.<n>Our key insight is to distill human-scene interactions from state-of-the-art video generation models.<n>ZeroHSI can synthesize realistic human motions in both static scenes and environments with dynamic objects.
arXiv Detail & Related papers (2024-12-24T18:55:38Z) - Scaling Up Dynamic Human-Scene Interaction Modeling [58.032368564071895]
TRUMANS is the most comprehensive motion-captured HSI dataset currently available.
It intricately captures whole-body human motions and part-level object dynamics.
We devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length.
arXiv Detail & Related papers (2024-03-13T15:45:04Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z) - High-Fidelity Neural Human Motion Transfer from Monocular Video [71.75576402562247]
Video-based human motion transfer creates video animations of humans following a source motion.
We present a new framework which performs high-fidelity and temporally-consistent human motion transfer with natural pose-dependent non-rigid deformations.
In the experimental results, we significantly outperform the state-of-the-art in terms of video realism.
arXiv Detail & Related papers (2020-12-20T16:54:38Z) - Learning Whole-Body Human-Robot Haptic Interaction in Social Contexts [11.879852629248981]
This paper presents a learning-from-demonstration (LfD) framework for teaching human-robot social interactions that involve whole-body haptic contact over the full robot body.
The performance of existing LfD frameworks suffers in such interactions due to high dimensionality data sparsity.
We show that by leveraging this sparsity, we can reduce the data dimensionality without incurring a significant accuracy penalty, and introduce three strategies for doing so.
arXiv Detail & Related papers (2020-05-26T03:44:09Z) - Hyperparameters optimization for Deep Learning based emotion prediction
for Human Robot Interaction [0.2549905572365809]
We have proposed an Inception module based Convolutional Neural Network Architecture.
The model is implemented in a humanoid robot, NAO in real time and robustness of the model is evaluated.
arXiv Detail & Related papers (2020-01-12T05:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.