NIST SRE CTS Superset: A large-scale dataset for telephony speaker
recognition
- URL: http://arxiv.org/abs/2108.07118v1
- Date: Mon, 16 Aug 2021 14:39:23 GMT
- Title: NIST SRE CTS Superset: A large-scale dataset for telephony speaker
recognition
- Authors: Seyed Omid Sadjadi
- Abstract summary: This document provides a brief description of the National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) conversational telephone speech (CTS) Superset.
The CTS Superset has been created in an attempt to provide the research community with a large-scale dataset.
It contains a large number of telephony speech segments from more than 6800 speakers with speech durations distributed uniformly in the [10s, 60s] range.
- Score: 2.5403247066589074
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This document provides a brief description of the National Institute of
Standards and Technology (NIST) speaker recognition evaluation (SRE)
conversational telephone speech (CTS) Superset. The CTS Superset has been
created in an attempt to provide the research community with a large-scale
dataset along with uniform metadata that can be used to effectively train and
develop telephony (narrowband) speaker recognition systems. It contains a large
number of telephony speech segments from more than 6800 speakers with speech
durations distributed uniformly in the [10s, 60s] range. The segments have been
extracted from the source corpora used to compile prior SRE datasets
(SRE1996-2012), including the Greybeard corpus as well as the Switchboard and
Mixer series collected by the Linguistic Data Consortium (LDC). In addition to
the brief description, we also report speaker recognition results on the NIST
2020 CTS Speaker Recognition Challenge, obtained using a system trained with
the CTS Superset. The results will serve as a reference baseline for the
challenge.
Related papers
- Language Modelling for Speaker Diarization in Telephonic Interviews [13.851959980488529]
Combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER.
The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.
arXiv Detail & Related papers (2025-01-28T18:18:04Z) - Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC [73.23245793460275]
Multi-talker speech recognition faces unique challenges in disentangling and transcribing overlapping speech.
This paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR.
We propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework.
arXiv Detail & Related papers (2024-09-19T01:26:33Z) - kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech [18.701864254184308]
kNN-TTS is a simple and effective framework for zero-shot multi-speaker text-to-speech.
Our models, trained on transcribed speech from a single speaker, achieve performance comparable to state-of-the-art models.
We also introduce a parameter which enables fine-grained voice morphing.
arXiv Detail & Related papers (2024-08-20T12:09:58Z) - Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models [0.0]
This paper presents a system for automatic speaker verification.
The primary objective of our model is the extraction of embeddings from the target speaker's audio.
This information is used in our multivoice TTS pipeline, which is currently under development.
arXiv Detail & Related papers (2024-06-27T15:08:51Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech [8.465993273653554]
We investigate the use of a multi-speaker Text-To-Speech system to synthesize speech in support of speaker recognition.
We observe on our datasets that TTS synthesized speech improves cross-domain speaker recognition performance.
We also explore the effectiveness of different types of text transcripts used for TTS synthesis.
arXiv Detail & Related papers (2020-11-24T00:48:54Z) - Learning Speaker Embedding from Text-to-Speech [59.80309164404974]
We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion.
We investigated training TTS from either manual or ASR-generated transcripts.
Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset.
arXiv Detail & Related papers (2020-10-21T18:03:16Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.