Related papers: An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction

An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction

URL: http://arxiv.org/abs/2602.20219v1
Date: Mon, 23 Feb 2026 09:05:15 GMT
Title: An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction
Authors: Guanting Shen, Zi Tian,
Abstract summary: This work presents a novel HRI framework that combines advanced vision-language models, speech processing, and fuzzy logic.<n>The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition.<n> Experimental evaluations conducted on consumer-grade hardware demonstrate a command execution accuracy of 75%.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Interpreting human intent accurately is a central challenge in human-robot interaction (HRI) and a key requirement for achieving more natural and intuitive collaboration between humans and machines. This work presents a novel multimodal HRI framework that combines advanced vision-language models, speech processing, and fuzzy logic to enable precise and adaptive control of a Dobot Magician robotic arm. The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition, providing users with a seamless and intuitive interface for object manipulation through spoken commands. By jointly addressing scene perception and action planning, the approach enhances the reliability of command interpretation and execution. Experimental evaluations conducted on consumer-grade hardware demonstrate a command execution accuracy of 75\%, highlighting both the robustness and adaptability of the system. Beyond its current performance, the proposed architecture serves as a flexible and extensible foundation for future HRI research, offering a practical pathway toward more sophisticated and natural human-robot collaboration through tightly coupled speech and vision-language processing.

Related papers

SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation [1.4175612723267692]
We present the first sign language-driven Vision-Language-Action (VLA) framework for intuitive human-robot interaction.<n>Unlike conventional approaches that rely on gloss annotations as intermediate supervision, the proposed system adopts a gloss-free paradigm.<n>We focus on a real-time alphabet-level finger-spelling interface that provides a robust and low-latency communication channel for robotic control.
arXiv Detail & Related papers (2026-02-26T01:16:27Z)
Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z)
Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents [19.868403110796105]
We present a framework for generating pointing gestures in embodied agents by combining imitation and reinforcement learning.<n>We evaluate the approach against supervised learning and retrieval baselines in both objective metrics and a virtual reality referential game with human users.
arXiv Detail & Related papers (2025-09-15T23:15:15Z)
Interpretable Robot Control via Structured Behavior Trees and Large Language Models [0.14990005092937678]
This paper presents a novel framework that bridges natural language understanding and robotic execution.<n>The proposed approach is practical in real-world scenarios, with an average cognition-to-execution accuracy of approximately 94%.
arXiv Detail & Related papers (2025-08-13T08:53:13Z)
Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z)
Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models [50.19518681574399]
A central challenge in real-world assistive teleoperation is for the robot to infer a wide range of human intentions from user control inputs.<n>We introduce Casper, an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models.<n>We show that Casper improves task performance, reduces human cognitive load, and achieves higher user satisfaction than direct teleoperation and assistive teleoperation baselines.
arXiv Detail & Related papers (2025-06-17T17:06:43Z)
Context-Aware Command Understanding for Tabletop Scenarios [1.7082212774297747]
This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios. By integrating multiple sources of information, including speech, gestures, and scene context, the system extracts actionable instructions for a robot. We discuss the strengths and limitations of the system, with particular focus on how it handles multimodal command interpretation.
arXiv Detail & Related papers (2024-10-08T20:46:39Z)
Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task [17.190635800969456]
In this paper, we examine using Large Language Models to infer human intention in a collaborative object categorization task with a physical robot.<n>We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions.
arXiv Detail & Related papers (2024-04-12T12:15:14Z)
Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs) Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z)
Proactive Human-Robot Interaction using Visuo-Lingual Transformers [0.0]
Humans possess the innate ability to extract latent visuo-lingual cues to infer context through human interaction. We propose a learning-based method that uses visual cues from the scene, lingual commands from a user and knowledge of prior object-object interaction to identify and proactively predict the underlying goal the user intends to achieve.
arXiv Detail & Related papers (2023-10-04T00:50:21Z)
WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system. We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration. We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z)
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control. Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z)
You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation. Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.