SuperVoice: Text-Independent Speaker Verification Using Ultrasound
Energy in Human Speech
- URL: http://arxiv.org/abs/2205.14496v1
- Date: Sat, 28 May 2022 18:00:50 GMT
- Title: SuperVoice: Text-Independent Speaker Verification Using Ultrasound
Energy in Human Speech
- Authors: Hanqing Guo, Qiben Yan, Nikolay Ivanov, Ying Zhu, Li Xiao, Eric J.
Hunter
- Abstract summary: Voice-activated systems are integrated into a variety of desktop, mobile, and Internet-of-Things (IoT) devices.
Existing speaker verification techniques distinguish individual speakers via the spectrographic features extracted from an audible frequency range of voice commands.
We propose a speaker verification system, SUPERVOICE, that uses a two-stream architecture with a feature fusion mechanism to generate distinctive speaker models.
- Score: 10.354590276508283
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Voice-activated systems are integrated into a variety of desktop, mobile, and
Internet-of-Things (IoT) devices. However, voice spoofing attacks, such as
impersonation and replay attacks, in which malicious attackers synthesize the
voice of a victim or simply replay it, have brought growing security concerns.
Existing speaker verification techniques distinguish individual speakers via
the spectrographic features extracted from an audible frequency range of voice
commands. However, they often have high error rates and/or long delays. In this
paper, we explore a new direction of human voice research by scrutinizing the
unique characteristics of human speech at the ultrasound frequency band. Our
research indicates that the high-frequency ultrasound components (e.g. speech
fricatives) from 20 to 48 kHz can significantly enhance the security and
accuracy of speaker verification. We propose a speaker verification system,
SUPERVOICE that uses a two-stream DNN architecture with a feature fusion
mechanism to generate distinctive speaker models. To test the system, we create
a speech dataset with 12 hours of audio (8,950 voice samples) from 127
participants. In addition, we create a second spoofed voice dataset to evaluate
its security. In order to balance between controlled recordings and real-world
applications, the audio recordings are collected from two quiet rooms by 8
different recording devices, including 7 smartphones and an ultrasound
microphone. Our evaluation shows that SUPERVOICE achieves 0.58% equal error
rate in the speaker verification task, it only takes 120 ms for testing an
incoming utterance, outperforming all existing speaker verification systems.
Moreover, within 91 ms processing time, SUPERVOICE achieves 0% equal error rate
in detecting replay attacks launched by 5 different loudspeakers.
Related papers
- EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech [0.5330251011543498]
We used our dataset of more than 45 hours of audio data from 633 individual male and female speakers.
We recorded the highest accuracy of 85.44%.
arXiv Detail & Related papers (2024-04-18T10:17:20Z) - Phoneme-Based Proactive Anti-Eavesdropping with Controlled Recording Privilege [26.3587130339825]
We propose a novel phoneme-based noise with the idea of informational masking, which can distract both machines and humans.
Our system can reduce the recognition accuracy of recordings to below 50% under all tested speech recognition systems.
arXiv Detail & Related papers (2024-01-28T16:56:56Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - Nonverbal Sound Detection for Disordered Speech [24.636175845214822]
We introduce an alternative voice-based input system which relies on sound event detection using fifteen non-verbal mouth sounds.
This system was designed to work regardless of ones' speech abilities and allows full access to existing technology.
arXiv Detail & Related papers (2022-02-15T22:02:58Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Attack on practical speaker verification system using universal
adversarial perturbations [20.38185341318529]
This work shows that by playing our crafted adversarial perturbation as a separate source when the adversary is speaking, the practical speaker verification system will misjudge the adversary as a target speaker.
A two-step algorithm is proposed to optimize the universal adversarial perturbation to be text-independent and has little effect on the authentication text recognition.
arXiv Detail & Related papers (2021-05-19T09:43:34Z) - FoolHD: Fooling speaker identification by Highly imperceptible
adversarial Disturbances [63.80959552818541]
We propose a white-box steganography-inspired adversarial attack that generates imperceptible perturbations against a speaker identification model.
Our approach, FoolHD, uses a Gated Convolutional Autoencoder that operates in the DCT domain and is trained with a multi-objective loss function.
We validate FoolHD with a 250-speaker identification x-vector network, trained using VoxCeleb, in terms of accuracy, success rate, and imperceptibility.
arXiv Detail & Related papers (2020-11-17T07:38:26Z) - Speaker De-identification System using Autoencoders and Adversarial
Training [58.720142291102135]
We propose a speaker de-identification system based on adversarial training and autoencoders.
Experimental results show that combining adversarial learning and autoencoders increase the equal error rate of a speaker verification system.
arXiv Detail & Related papers (2020-11-09T19:22:05Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.