Effects of language mismatch in automatic forensic voice comparison
using deep learning embeddings
- URL: http://arxiv.org/abs/2209.12602v1
- Date: Mon, 26 Sep 2022 11:49:37 GMT
- Title: Effects of language mismatch in automatic forensic voice comparison
using deep learning embeddings
- Authors: D\'avid Sztah\'o and Attila Fejes
- Abstract summary: This study aims to investigate whether a model pre-trained on English corpus can be used on a target low-resource language.
It was found that the model pre-trained on a different language but on a corpus with a huge amount of speakers performs well on samples with language mismatch.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In forensic voice comparison the speaker embedding has become widely popular
in the last 10 years. Most of the pretrained speaker embeddings are trained on
English corpora, because it is easily accessible. Thus, language dependency can
be an important factor in automatic forensic voice comparison, especially when
the target language is linguistically very different. There are numerous
commercial systems available, but their models are mainly trained on a
different language (mostly English) than the target language. In the case of a
low-resource language, developing a corpus for forensic purposes containing
enough speakers to train deep learning models is costly. This study aims to
investigate whether a model pre-trained on English corpus can be used on a
target low-resource language (here, Hungarian), different from the model is
trained on. Also, often multiple samples are not available from the offender
(unknown speaker). Therefore, samples are compared pairwise with and without
speaker enrollment for suspect (known) speakers. Two corpora are applied that
were developed especially for forensic purposes, and a third that is meant for
traditional speaker verification. Two deep learning based speaker embedding
vector extraction methods are used: the x-vector and ECAPA-TDNN. Speaker
verification was evaluated in the likelihood-ratio framework. A comparison is
made between the language combinations (modeling, LR calibration, evaluation).
The results were evaluated by minCllr and EER metrics. It was found that the
model pre-trained on a different language but on a corpus with a huge amount of
speakers performs well on samples with language mismatch. The effect of sample
durations and speaking styles were also examined. It was found that the longer
the duration of the sample in question the better the performance is. Also,
there is no real difference if various speaking styles are applied.
Related papers
- Mispronunciation detection using self-supervised speech representations [10.010024759851142]
We study the use of SSL models for the task of mispronunciation detection for second language learners.
We compare two downstream approaches: 1) training the model for phone recognition using native English data, and 2) training a model directly for the target task using non-native English data.
arXiv Detail & Related papers (2023-07-30T21:20:58Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - How Robust is Neural Machine Translation to Language Imbalance in
Multilingual Tokenizer Training? [86.48323488619629]
We analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus.
We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected.
arXiv Detail & Related papers (2022-04-29T17:50:36Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech
Recognition System [3.4888132404740797]
We evaluate a state-of-the-art automatic speech recognition model, using unseen data from a corpus with a wide variety of labeled English accents.
We show that there is indeed an accuracy bias in terms of accentual variety, favoring the accents most prevalent in the training corpus.
arXiv Detail & Related papers (2021-05-09T08:24:33Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.