The 2021 NIST Speaker Recognition Evaluation
- URL: http://arxiv.org/abs/2204.10242v1
- Date: Thu, 21 Apr 2022 16:18:52 GMT
- Title: The 2021 NIST Speaker Recognition Evaluation
- Authors: Seyed Omid Sadjadi and Craig Greenberg and Elliot Singer and Lisa
Mason and Douglas Reynolds
- Abstract summary: The 2021 Speaker Recognition Evaluation (SRE21) was the latest cycle of the ongoing evaluation series conducted by the U.S. National Institute of Standards and Technology (NIST) since 1996.
This paper presents an overview of SRE21 including the tasks, performance metric, data, evaluation protocol, results and system performance analyses.
- Score: 1.5282767384702267
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The 2021 Speaker Recognition Evaluation (SRE21) was the latest cycle of the
ongoing evaluation series conducted by the U.S. National Institute of Standards
and Technology (NIST) since 1996. It was the second large-scale multimodal
speaker/person recognition evaluation organized by NIST (the first one being
SRE19). Similar to SRE19, it featured two core evaluation tracks, namely audio
and audio-visual, as well as an optional visual track. In addition to offering
fixed and open training conditions, it also introduced new challenges for the
community, thanks to a new multimodal (i.e., audio, video, and selfie images)
and multilingual (i.e., with multilingual speakers) corpus, termed WeCanTalk,
collected outside North America by the Linguistic Data Consortium (LDC). These
challenges included: 1) trials (target and non-target) with enrollment and test
segments originating from different domains (i.e., telephony versus video), and
2) trials (target and non-target) with enrollment and test segments spoken in
different languages (i.e., cross-lingual trials). This paper presents an
overview of SRE21 including the tasks, performance metric, data, evaluation
protocol, results and system performance analyses. A total of 23 organizations
(forming 15 teams) from academia and industry participated in SRE21 and
submitted 158 valid system outputs. Evaluation results indicate: audio-visual
fusion produce substantial gains in performance over audio-only or visual-only
systems; top performing speaker and face recognition systems exhibited
comparable performance under the matched domain conditions present in this
evaluation; and, the use of complex neural network architectures (e.g., ResNet)
along with angular losses with margin, data augmentation, as well as long
duration fine-tuning contributed to notable performance improvements for the
audio-only speaker recognition task.
Related papers
- TTSDS -- Text-to-Speech Distribution Score [9.380879437204277]
Many recently published Text-to-Speech (TTS) systems produce audio close to real speech.
We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility.
We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations.
arXiv Detail & Related papers (2024-07-17T16:30:27Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - KIT's Multilingual Speech Translation System for IWSLT 2023 [58.5152569458259]
We describe our speech translation system for the multilingual track of IWSLT 2023.
The task requires translation into 10 languages of varying amounts of resources.
Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation.
arXiv Detail & Related papers (2023-06-08T16:13:20Z) - The 2022 NIST Language Recognition Evaluation [1.3730035576297057]
In 2022, the U.S. National Institute of Standards and Technology (NIST) conducted the latest Language Recognition Evaluation (LRE)
Similar to previous LREs, LRE22 focused on conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) data.
This paper presents an overview of LRE22 and an analysis of system performance over different evaluation conditions.
arXiv Detail & Related papers (2023-02-28T15:05:33Z) - L2 proficiency assessment using self-supervised speech representations [35.70742768910494]
This work extends the initial analysis conducted on a self-supervised speech representation based scheme, requiring no speech recognition, to a large scale proficiency test.
The performance of the self-supervised, wav2vec 2.0, system is compared to a high performance hand-crafted assessment system and a BERT-based text system.
Though the wav2vec 2.0 based system is found to be sensitive to the nature of the response, it can be configured to yield comparable performance to systems requiring a speech transcription.
arXiv Detail & Related papers (2022-11-16T11:47:20Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - ESPnet-ST IWSLT 2021 Offline Speech Translation System [56.83606198051871]
This paper describes the ESPnet-ST group's IWSLT 2021 submission in the offline speech translation track.
This year we made various efforts on training data, architecture, and audio segmentation.
Our best E2E system combined all the techniques with model ensembling and achieved 31.4 BLEU.
arXiv Detail & Related papers (2021-07-01T17:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.