Universal speaker recognition encoders for different speech segments
duration
- URL: http://arxiv.org/abs/2210.16231v1
- Date: Fri, 28 Oct 2022 16:06:00 GMT
- Title: Universal speaker recognition encoders for different speech segments
duration
- Authors: Sergey Novoselov, Vladimir Volokhov, Galina Lavrentyeva
- Abstract summary: A system trained simultaneously on pooled short and long speech segments does not give optimal verification results.
We describe our simple recipe for training universal speaker encoder for any type of selected neural network architecture.
- Score: 7.104489204959814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Creating universal speaker encoders which are robust for different acoustic
and speech duration conditions is a big challenge today. According to our
observations systems trained on short speech segments are optimal for short
phrase speaker verification and systems trained on long segments are superior
for long segments verification. A system trained simultaneously on pooled short
and long speech segments does not give optimal verification results and usually
degrades both for short and long segments. This paper addresses the problem of
creating universal speaker encoders for different speech segments duration. We
describe our simple recipe for training universal speaker encoder for any type
of selected neural network architecture. According to our evaluation results of
wav2vec-TDNN based systems obtained for NIST SRE and VoxCeleb1 benchmarks the
proposed universal encoder provides speaker verification improvements in case
of different enrollment and test speech segment duration. The key feature of
the proposed encoder is that it has the same inference time as the selected
neural network architecture.
Related papers
- DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context.
To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z) - Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary
Network [28.661704280484457]
We propose Word-level End-to-End Neural Diarization (WEEND) with auxiliary network.
We find WEEND has the potential to deliver high quality diarized text.
arXiv Detail & Related papers (2023-09-15T15:48:45Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR [38.79441296832869]
We propose an end-to-end ASR model capable of predicting segment boundaries in a streaming fashion.
We demonstrate 8.5% relative WER improvement and 250 ms reduction in median end-of-segment latency compared to the VAD segmenter baseline on a state-of-the-art Conformer RNN-T model.
arXiv Detail & Related papers (2022-04-22T15:13:12Z) - Revisiting joint decoding based multi-talker speech recognition with DNN
acoustic model [34.061441900912136]
We argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly.
We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers.
arXiv Detail & Related papers (2021-10-31T09:28:04Z) - SpEx: Multi-Scale Time Domain Speaker Extraction Network [89.00319878262005]
Speaker extraction aims to mimic humans' selective auditory attention by extracting a target speaker's voice from a multi-talker environment.
It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra.
We propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra.
arXiv Detail & Related papers (2020-04-17T16:13:06Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.