NIST SRE CTS Superset: A large-scale dataset for telephony speaker
recognition
- URL: http://arxiv.org/abs/2108.07118v1
- Date: Mon, 16 Aug 2021 14:39:23 GMT
- Title: NIST SRE CTS Superset: A large-scale dataset for telephony speaker
recognition
- Authors: Seyed Omid Sadjadi
- Abstract summary: This document provides a brief description of the National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) conversational telephone speech (CTS) Superset.
The CTS Superset has been created in an attempt to provide the research community with a large-scale dataset.
It contains a large number of telephony speech segments from more than 6800 speakers with speech durations distributed uniformly in the [10s, 60s] range.
- Score: 2.5403247066589074
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This document provides a brief description of the National Institute of
Standards and Technology (NIST) speaker recognition evaluation (SRE)
conversational telephone speech (CTS) Superset. The CTS Superset has been
created in an attempt to provide the research community with a large-scale
dataset along with uniform metadata that can be used to effectively train and
develop telephony (narrowband) speaker recognition systems. It contains a large
number of telephony speech segments from more than 6800 speakers with speech
durations distributed uniformly in the [10s, 60s] range. The segments have been
extracted from the source corpora used to compile prior SRE datasets
(SRE1996-2012), including the Greybeard corpus as well as the Switchboard and
Mixer series collected by the Linguistic Data Consortium (LDC). In addition to
the brief description, we also report speaker recognition results on the NIST
2020 CTS Speaker Recognition Challenge, obtained using a system trained with
the CTS Superset. The results will serve as a reference baseline for the
challenge.
Related papers
- Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models [0.0]
This paper presents a system for automatic speaker verification.
The primary objective of our model is the extraction of embeddings from the target speaker's audio.
This information is used in our multivoice TTS pipeline, which is currently under development.
arXiv Detail & Related papers (2024-06-27T15:08:51Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - The NIST CTS Speaker Recognition Challenge [1.5282767384702267]
The US National Institute of Standards and Technology (NIST) has been conducting a second iteration of the CTS Challenge since August 2020.
This paper presents an overview of the evaluation and several analyses of system performance for some primary conditions in the CTS Challenge.
arXiv Detail & Related papers (2022-04-21T16:06:27Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech [8.465993273653554]
We investigate the use of a multi-speaker Text-To-Speech system to synthesize speech in support of speaker recognition.
We observe on our datasets that TTS synthesized speech improves cross-domain speaker recognition performance.
We also explore the effectiveness of different types of text transcripts used for TTS synthesis.
arXiv Detail & Related papers (2020-11-24T00:48:54Z) - Learning Speaker Embedding from Text-to-Speech [59.80309164404974]
We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion.
We investigated training TTS from either manual or ASR-generated transcripts.
Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset.
arXiv Detail & Related papers (2020-10-21T18:03:16Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z) - Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice
Conversion without Parallel Data [5.249587285519702]
Cotatron is a transcription-guided speech encoder for speaker-independent linguistic representation.
We train a voice conversion system to reconstruct speech with Cotatron features, which is similar to the previous methods.
Our system can also convert speech from speakers that are unseen during training, and utilize ASR to automate the transcription with minimal reduction of the performance.
arXiv Detail & Related papers (2020-05-07T07:37:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.