Related papers: NIST SRE CTS Superset: A large-scale dataset for telephony speaker recognition

NIST SRE CTS Superset: A large-scale dataset for telephony speaker recognition

URL: http://arxiv.org/abs/2108.07118v1
Date: Mon, 16 Aug 2021 14:39:23 GMT
Title: NIST SRE CTS Superset: A large-scale dataset for telephony speaker recognition
Authors: Seyed Omid Sadjadi
Abstract summary: This document provides a brief description of the National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) conversational telephone speech (CTS) Superset. The CTS Superset has been created in an attempt to provide the research community with a large-scale dataset. It contains a large number of telephony speech segments from more than 6800 speakers with speech durations distributed uniformly in the [10s, 60s] range.
Score: 2.5403247066589074
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This document provides a brief description of the National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) conversational telephone speech (CTS) Superset. The CTS Superset has been created in an attempt to provide the research community with a large-scale dataset along with uniform metadata that can be used to effectively train and develop telephony (narrowband) speaker recognition systems. It contains a large number of telephony speech segments from more than 6800 speakers with speech durations distributed uniformly in the [10s, 60s] range. The segments have been extracted from the source corpora used to compile prior SRE datasets (SRE1996-2012), including the Greybeard corpus as well as the Switchboard and Mixer series collected by the Linguistic Data Consortium (LDC). In addition to the brief description, we also report speaker recognition results on the NIST 2020 CTS Speaker Recognition Challenge, obtained using a system trained with the CTS Superset. The results will serve as a reference baseline for the challenge.

Related papers

Language Modelling for Speaker Diarization in Telephonic Interviews [13.851959980488529]
Combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.
arXiv Detail & Related papers (2025-01-28T18:18:04Z)
Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC [73.23245793460275]
Multi-talker speech recognition faces unique challenges in disentangling and transcribing overlapping speech. This paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. We propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework.
arXiv Detail & Related papers (2024-09-19T01:26:33Z)
kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech [18.701864254184308]
kNN-TTS is a simple and effective framework for zero-shot multi-speaker text-to-speech. Our models, trained on transcribed speech from a single speaker, achieve performance comparable to state-of-the-art models. We also introduce a parameter which enables fine-grained voice morphing.
arXiv Detail & Related papers (2024-08-20T12:09:58Z)
Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models [0.0]
This paper presents a system for automatic speaker verification. The primary objective of our model is the extraction of embeddings from the target speaker's audio. This information is used in our multivoice TTS pipeline, which is currently under development.
arXiv Detail & Related papers (2024-06-27T15:08:51Z)
Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker. Network addresses limitations of SIMO models by aggregating cross-speaker representations. Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z)
DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z)
SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD. Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z)
The NIST CTS Speaker Recognition Challenge [1.5282767384702267]
The US National Institute of Standards and Technology (NIST) has been conducting a second iteration of the CTS Challenge since August 2020. This paper presents an overview of the evaluation and several analyses of system performance for some primary conditions in the CTS Challenge.
arXiv Detail & Related papers (2022-04-21T16:06:27Z)
End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z)
Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech [8.465993273653554]
We investigate the use of a multi-speaker Text-To-Speech system to synthesize speech in support of speaker recognition. We observe on our datasets that TTS synthesized speech improves cross-domain speaker recognition performance. We also explore the effectiveness of different types of text transcripts used for TTS synthesis.
arXiv Detail & Related papers (2020-11-24T00:48:54Z)
Learning Speaker Embedding from Text-to-Speech [59.80309164404974]
We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion. We investigated training TTS from either manual or ASR-generated transcripts. Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset.
arXiv Detail & Related papers (2020-10-21T18:03:16Z)
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS) A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data [5.249587285519702]
Cotatron is a transcription-guided speech encoder for speaker-independent linguistic representation. We train a voice conversion system to reconstruct speech with Cotatron features, which is similar to the previous methods. Our system can also convert speech from speakers that are unseen during training, and utilize ASR to automate the transcription with minimal reduction of the performance.
arXiv Detail & Related papers (2020-05-07T07:37:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.