Retrieving Speaker Information from Personalized Acoustic Models for
Speech Recognition
- URL: http://arxiv.org/abs/2111.04194v1
- Date: Sun, 7 Nov 2021 22:17:52 GMT
- Title: Retrieving Speaker Information from Personalized Acoustic Models for
Speech Recognition
- Authors: Salima Mdhaffar, Jean-Fran\c{c}ois Bonastre, Marc Tommasi, Natalia
Tomashenko, Yannick Est\`eve
- Abstract summary: We show that it is possible to retrieve the gender of the speaker, but also his identity, by just exploiting the weight matrix changes of a neural acoustic model locally adapted to this speaker.
In this paper, we show that it is possible to retrieve the gender of the speaker, but also his identity, by just exploiting the weight matrix changes of a neural acoustic model locally adapted to this speaker.
- Score: 5.1229352884025845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread of powerful personal devices capable of collecting voice of
their users has opened the opportunity to build speaker adapted speech
recognition system (ASR) or to participate to collaborative learning of ASR. In
both cases, personalized acoustic models (AM), i.e. fine-tuned AM with specific
speaker data, can be built. A question that naturally arises is whether the
dissemination of personalized acoustic models can leak personal information. In
this paper, we show that it is possible to retrieve the gender of the speaker,
but also his identity, by just exploiting the weight matrix changes of a neural
acoustic model locally adapted to this speaker. Incidentally we observe
phenomena that may be useful towards explainability of deep neural networks in
the context of speech processing. Gender can be identified almost surely using
only the first layers and speaker verification performs well when using
middle-up layers. Our experimental study on the TED-LIUM 3 dataset with
HMM/TDNN models shows an accuracy of 95% for gender detection, and an Equal
Error Rate of 9.07% for a speaker verification task by only exploiting the
weights from personalized models that could be exchanged instead of user data.
Related papers
- Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Speaker Identification using Speech Recognition [0.0]
This research provides a mechanism for identifying a speaker in an audio file, based on the human voice biometric features like pitch, amplitude, frequency etc.
We proposed an unsupervised learning model where the model can learn speech representation with limited dataset.
arXiv Detail & Related papers (2022-05-29T13:03:42Z) - Privacy attacks for automatic speech recognition acoustic models in a
federated learning framework [5.1229352884025845]
We propose an approach to analyze information in neural network AMs based on a neural network footprint on the Indicator dataset.
Experiments on the TED-LIUM 3 corpus demonstrate that the proposed approaches are very effective and can provide equal error rate (EER) of 1-2%.
arXiv Detail & Related papers (2021-11-06T02:08:13Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - A Lightweight Speaker Recognition System Using Timbre Properties [0.5708902722746041]
We propose a lightweight text-independent speaker recognition model based on random forest classifier.
It also introduces new features that are used for both speaker verification and identification tasks.
The prototype uses seven most actively searched properties, boominess, brightness, depth, hardness, timbre, sharpness, and warmth.
arXiv Detail & Related papers (2020-10-12T07:56:03Z) - Improving on-device speaker verification using federated learning with
privacy [5.321241042620525]
Information on speaker characteristics can be useful as side information in improving speaker recognition accuracy.
This paper investigates how privacy-preserving learning can improve a speaker verification system.
arXiv Detail & Related papers (2020-08-06T13:37:14Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.