Moravec's Paradox: Towards an Auditory Turing Test
- URL: http://arxiv.org/abs/2507.23091v1
- Date: Wed, 30 Jul 2025 20:45:13 GMT
- Title: Moravec's Paradox: Towards an Auditory Turing Test
- Authors: David Noever, Forrest McKee,
- Abstract summary: This research work demonstrates that current AI systems fail catastrophically on auditory tasks that humans perform effortlessly.<n>We introduce an auditory Turing test comprising 917 challenges across seven categories: overlapping speech, speech in noise, temporal distortion, spatial audio, coffee-shop noise, phone distortion, and perceptual illusions.<n>Our evaluation of state-of-the-art audio models including GPT-4's audio capabilities and OpenAI's Whisper reveals a striking failure rate exceeding 93%.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This research work demonstrates that current AI systems fail catastrophically on auditory tasks that humans perform effortlessly. Drawing inspiration from Moravec's paradox (i.e., tasks simple for humans often prove difficult for machines, and vice versa), we introduce an auditory Turing test comprising 917 challenges across seven categories: overlapping speech, speech in noise, temporal distortion, spatial audio, coffee-shop noise, phone distortion, and perceptual illusions. Our evaluation of state-of-the-art audio models including GPT-4's audio capabilities and OpenAI's Whisper reveals a striking failure rate exceeding 93%, with even the best-performing model achieving only 6.9% accuracy on tasks that humans solved at 7.5 times higher success (52%). These results expose focusing failures in how AI systems process complex auditory scenes, particularly in selective attention, noise robustness, and contextual adaptation. Our benchmark not only quantifies the human-machine auditory gap but also provides insights into why these failures occur, suggesting that current architectures lack fundamental mechanisms for human-like auditory scene analysis. The traditional design of audio CAPTCHAs highlights common filters that humans evolved but machines fail to select in multimodal language models. This work establishes a diagnostic framework for measuring progress toward human-level machine listening and highlights the need for novel approaches integrating selective attention, physics-based audio understanding, and context-aware perception into multimodal AI systems.
Related papers
- Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges [58.80034860169605]
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech.<n>This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions.
arXiv Detail & Related papers (2025-07-24T07:56:24Z) - AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z) - Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization [9.726581483075938]
Humans consistently outperform AI, demonstrating superior resilience to conflicting or missing visuals by relying on auditory information.<n>We finetune a state-of-the-art model using a stereo audio-image dataset generated via 3D simulations.
arXiv Detail & Related papers (2025-05-16T13:13:25Z) - Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.<n>These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - PITCH: AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response [14.604998731837595]
We develop PITCH, a robust challenge-response method to detect and tag interactive deepfake audio calls.<n>PITCH's challenges enhanced machine detection capabilities to 88.7% AUROC score.<n>We develop a novel human-AI collaborative system that tags suspicious calls as "Deepfake-likely"
arXiv Detail & Related papers (2024-02-28T06:17:55Z) - Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual
Imitation Learning [62.83590925557013]
We learn a set of challenging partially-observed manipulation tasks from visual and audio inputs.
Our proposed system learns these tasks by combining offline imitation learning from tele-operated demonstrations and online finetuning.
In a set of simulated tasks, we find that our system benefits from using audio, and that by using online interventions we are able to improve the success rate of offline imitation learning by 20%.
arXiv Detail & Related papers (2022-05-30T04:52:58Z) - SuperVoice: Text-Independent Speaker Verification Using Ultrasound
Energy in Human Speech [10.354590276508283]
Voice-activated systems are integrated into a variety of desktop, mobile, and Internet-of-Things (IoT) devices.
Existing speaker verification techniques distinguish individual speakers via the spectrographic features extracted from an audible frequency range of voice commands.
We propose a speaker verification system, SUPERVOICE, that uses a two-stream architecture with a feature fusion mechanism to generate distinctive speaker models.
arXiv Detail & Related papers (2022-05-28T18:00:50Z) - Nonverbal Sound Detection for Disordered Speech [24.636175845214822]
We introduce an alternative voice-based input system which relies on sound event detection using fifteen non-verbal mouth sounds.
This system was designed to work regardless of ones' speech abilities and allows full access to existing technology.
arXiv Detail & Related papers (2022-02-15T22:02:58Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.