Talking to Robots: A Practical Examination of Speech Foundation Models for HRI Applications
- URL: http://arxiv.org/abs/2508.17753v1
- Date: Mon, 25 Aug 2025 07:45:20 GMT
- Title: Talking to Robots: A Practical Examination of Speech Foundation Models for HRI Applications
- Authors: Theresa Pekarek Rosin, Julia Gachot, Henri-Leon Kordt, Matthias Kerzel, Stefan Wermter,
- Abstract summary: In human-robot interaction (HRI), these challenges intersect to create a uniquely challenging recognition environment.<n>We evaluate four state-of-the-art ASR systems on eight publicly available datasets that capture six dimensions of difficulty.<n>Our analysis demonstrates significant variations in performance, hallucination tendencies, and inherent biases, despite similar scores on standard benchmarks.
- Score: 7.943770437477042
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Speech Recognition (ASR) systems in real-world settings need to handle imperfect audio, often degraded by hardware limitations or environmental noise, while accommodating diverse user groups. In human-robot interaction (HRI), these challenges intersect to create a uniquely challenging recognition environment. We evaluate four state-of-the-art ASR systems on eight publicly available datasets that capture six dimensions of difficulty: domain-specific, accented, noisy, age-variant, impaired, and spontaneous speech. Our analysis demonstrates significant variations in performance, hallucination tendencies, and inherent biases, despite similar scores on standard benchmarks. These limitations have serious implications for HRI, where recognition errors can interfere with task performance, user trust, and safety.
Related papers
- SVeritas: Benchmark for Robust Speaker Verification under Diverse Conditions [54.34001921326444]
Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems.<n>Existing benchmarks evaluate only subsets of these conditions, missing others entirely.<n>We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks.
arXiv Detail & Related papers (2025-09-21T14:11:16Z) - Moravec's Paradox: Towards an Auditory Turing Test [0.0]
This research work demonstrates that current AI systems fail catastrophically on auditory tasks that humans perform effortlessly.<n>We introduce an auditory Turing test comprising 917 challenges across seven categories: overlapping speech, speech in noise, temporal distortion, spatial audio, coffee-shop noise, phone distortion, and perceptual illusions.<n>Our evaluation of state-of-the-art audio models including GPT-4's audio capabilities and OpenAI's Whisper reveals a striking failure rate exceeding 93%.
arXiv Detail & Related papers (2025-07-30T20:45:13Z) - Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges [58.80034860169605]
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech.<n>This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions.
arXiv Detail & Related papers (2025-07-24T07:56:24Z) - BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition [0.5224038339798622]
We present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset.<n>The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents.<n>We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion.
arXiv Detail & Related papers (2025-04-30T14:08:14Z) - Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech [0.0]
Speech recognition systems fail to accurately interpret speech patterns deviating from typical fluency, leading to critical usability issues and misinterpretations.
This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark.
Results reveal a consistent and statistically significant accuracy bias across all ASRs against disfluent speech, manifesting in significant syntactical and semantic inaccuracies in transcriptions.
arXiv Detail & Related papers (2024-05-10T00:16:58Z) - Speaker-Independent Dysarthria Severity Classification using
Self-Supervised Transformers and Multi-Task Learning [2.7706924578324665]
This study presents a transformer-based framework for automatically assessing dysarthria severity from raw speech data.
We develop a framework, called Speaker-Agnostic Latent Regularisation (SALR), incorporating a multi-task learning objective and contrastive learning for speaker-independent multi-class dysarthria severity classification.
Our model demonstrated superior performance over traditional machine learning approaches, with an accuracy of $70.48%$ and an F1 score of $59.23%$.
arXiv Detail & Related papers (2024-02-29T18:30:52Z) - The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple
Devices in Diverse Scenarios [61.74042680711718]
We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge.
This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices.
The goal is for participants to devise a single system that can generalize across different array geometries.
arXiv Detail & Related papers (2023-06-23T18:49:20Z) - I am Only Happy When There is Light: The Impact of Environmental Changes
on Affective Facial Expressions Recognition [65.69256728493015]
We study the impact of different image conditions on the recognition of arousal from human facial expressions.
Our results show how the interpretation of human affective states can differ greatly in either the positive or negative direction.
arXiv Detail & Related papers (2022-10-28T16:28:26Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.