Related papers: Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization

Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization

URL: http://arxiv.org/abs/2512.17958v1
Date: Thu, 18 Dec 2025 08:44:22 GMT
Title: Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization
Authors: Farida Mohsen, Ali Safa,
Abstract summary: Service robots in public spaces require real-time understanding of human behavioral intentions for natural interaction.<n>We present a framework for frame-accurate human-robot interaction intent detection that fuses camera-invariant 2D skeletal pose and facial emotion features extracted from monocular RGB video.
Score: 0.8839687029212673
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Service robots in public spaces require real-time understanding of human behavioral intentions for natural interaction. We present a practical multimodal framework for frame-accurate human-robot interaction intent detection that fuses camera-invariant 2D skeletal pose and facial emotion features extracted from monocular RGB video. Unlike prior methods requiring RGB-D sensors or GPU acceleration, our approach resource-constrained embedded hardware (Raspberry Pi 5, CPU-only). To address the severe class imbalance in natural human-robot interaction datasets, we introduce a novel approach to synthesize temporally coherent pose-emotion-label sequences for data re-balancing called MINT-RVAE (Multimodal Recurrent Variational Autoencoder for Intent Sequence Generation). Comprehensive offline evaluations under cross-subject and cross-scene protocols demonstrate strong generalization performance, achieving frame- and sequence-level AUROC of 0.95. Crucially, we validate real-world generalization through cross-camera evaluation on the MIRA robot head, which employs a different onboard RGB sensor and operates in uncontrolled environments not represented in the training data. Despite this domain shift, the deployed system achieves 91% accuracy and 100% recall across 32 live interaction trials. The close correspondence between offline and deployed performance confirms the cross-sensor and cross-environment robustness of the proposed multimodal approach, highlighting its suitability for ubiquitous multimedia-enabled social robots.

Related papers

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition [71.5328300638085]
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions.<n>Existing methods, including two-stage methods, tightly couple interaction recognition with a specific detector.<n>We propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR.
arXiv Detail & Related papers (2026-02-16T19:01:31Z)
MINT-RVAE: Multi-Cues Intention Prediction of Human-Robot Interaction using Human Pose and Emotion Information from RGB-only Camera Data [0.8839687029212673]
We propose a novel pipeline for predicting human interaction intent with frame-level precision.<n>A key challenge in intent prediction is the class imbalance inherent in real-world HRI datasets.<n>Our approach achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-09-26T16:49:40Z)
Recognizing Actions from Robotic View for Natural Human-Robot Interaction [52.00935005918032]
Natural Human-Robot Interaction (N-HRI) requires robots to recognize human actions at varying distances and states, regardless of whether the robot itself is in motion or stationary.<n>Existing benchmarks for N-HRI fail to address the unique complexities in N-HRI due to limited data, modalities, task categories, and diversity of subjects and environments.<n>We introduce (Action from Robotic View) a large-scale dataset for perception-centric robotic views prevalent in mobile service robots.
arXiv Detail & Related papers (2025-07-30T09:48:34Z)
ARC-Calib: Autonomous Markerless Camera-to-Robot Calibration via Exploratory Robot Motions [15.004750210002152]
ARC-Calib is a model-based markerless camera-to-robot calibration framework.<n>It is fully autonomous and generalizable across diverse robots.
arXiv Detail & Related papers (2025-03-18T20:03:32Z)
Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction. The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z)
Multimodal Anomaly Detection based on Deep Auto-Encoder for Object Slip Perception of Mobile Manipulation Robots [22.63980025871784]
The proposed framework integrates heterogeneous data streams collected from various robot sensors, including RGB and depth cameras, a microphone, and a force-torque sensor. The integrated data is used to train a deep autoencoder to construct latent representations of the multisensory data that indicate the normal status. Anomalies can then be identified by error scores measured by the difference between the trained encoder's latent values and the latent values of reconstructed input data.
arXiv Detail & Related papers (2024-03-06T09:15:53Z)
EventTransAct: A video transformer-based framework for Event-camera based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos. In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame. In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
SABER: Data-Driven Motion Planner for Autonomously Navigating Heterogeneous Robots [112.2491765424719]
We present an end-to-end online motion planning framework that uses a data-driven approach to navigate a heterogeneous robot team towards a global goal. We use model predictive control (SMPC) to calculate control inputs that satisfy robot dynamics, and consider uncertainty during obstacle avoidance with chance constraints. recurrent neural networks are used to provide a quick estimate of future state uncertainty considered in the SMPC finite-time horizon solution. A Deep Q-learning agent is employed to serve as a high-level path planner, providing the SMPC with target positions that move the robots towards a desired global goal.
arXiv Detail & Related papers (2021-08-03T02:56:21Z)
Enhancing Multi-Robot Perception via Learned Data Association [37.866254392010454]
We address the multi-robot collaborative perception problem, specifically in the context of multi-view infilling for distributed semantic segmentation. We propose the Multi-Agent Infilling Network: an neural architecture that can be deployed to each agent in a robotic swarm. Specifically, each robot is in charge of locally encoding and decoding visual information, and an neural mechanism allows for an uncertainty-aware and context-based exchange of intermediate features.
arXiv Detail & Related papers (2021-07-01T22:45:26Z)
Few-Shot Visual Grounding for Natural Human-Robot Interaction [0.0]
We propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user. At the core of our system, we employ a multi-modal deep neural network for visual grounding. We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets.
arXiv Detail & Related papers (2021-03-17T15:24:02Z)
Domain Adaptive Robotic Gesture Recognition with Unsupervised Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot. It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture. Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.