T-vectors: Weakly Supervised Speaker Identification Using Hierarchical
Transformer Model
- URL: http://arxiv.org/abs/2010.16071v1
- Date: Thu, 29 Oct 2020 09:38:17 GMT
- Title: T-vectors: Weakly Supervised Speaker Identification Using Hierarchical
Transformer Model
- Authors: Yanpei Shi, Mingjie Chen, Qiang Huang, Thomas Hain
- Abstract summary: This paper proposes a hierarchical network with transformer encoders and memory mechanism to address this problem.
The proposed model contains a frame-level encoder and segment-level encoder, both of them make use of the transformer encoder block.
- Score: 36.372432408617584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identifying multiple speakers without knowing where a speaker's voice is in a
recording is a challenging task. This paper proposes a hierarchical network
with transformer encoders and memory mechanism to address this problem. The
proposed model contains a frame-level encoder and segment-level encoder, both
of them make use of the transformer encoder block. The multi-head attention
mechanism in the transformer structure could better capture different speaker
properties when the input utterance contains multiple speakers. The memory
mechanism used in the frame-level encoders can build a recurrent connection
that better capture long-term speaker features. The experiments are conducted
on artificial datasets based on the Switchboard Cellular part1 (SWBC) and
Voxceleb1 datasets. In different data construction scenarios (Concat and
Overlap), the proposed model shows better performance comparaing with four
strong baselines, reaching 13.3% and 10.5% relative improvement compared with
H-vectors and S-vectors. The use of memory mechanism could reach 10.6% and 7.7%
relative improvement compared with not using memory mechanism.
Related papers
- How Redundant Is the Transformer Stack in Speech Representation Models? [1.3873323883842132]
Self-supervised speech representation models have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection.
Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning.
We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training.
arXiv Detail & Related papers (2024-09-10T11:00:24Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - End-to-End Speaker-Attributed ASR with Transformer [41.7739129773237]
This paper presents an end-to-end speaker-attributed automatic speech recognition system.
It jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio.
arXiv Detail & Related papers (2021-04-05T19:54:15Z) - Transformer Based Deliberation for Two-Pass Speech Recognition [46.86118010771703]
Speech recognition systems must generate words quickly while also producing accurate results.
Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate.
Previous work has established that a deliberation network can be an effective second-pass model.
arXiv Detail & Related papers (2021-01-27T18:05:22Z) - Dual-decoder Transformer for Joint Automatic Speech Recognition and
Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST)
Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Self-attention encoding and pooling for speaker recognition [16.96341561111918]
We propose a tandem Self-Attention and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances.
SAEP encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification.
We have evaluated this approach on both VoxCeleb1 & 2 datasets.
arXiv Detail & Related papers (2020-08-03T09:31:27Z) - MultiSpeech: Multi-Speaker Text to Speech with Transformer [145.56725956639232]
Transformer-based text to speech (TTS) model (e.g., Transformer TTSciteli 2019neural, FastSpeechciteren 2019fastspeech) has shown the advantages of training and inference efficiency over RNN-based model.
We develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment.
arXiv Detail & Related papers (2020-06-08T15:05:28Z) - Unsupervised Speaker Adaptation using Attention-based Speaker Memory for
End-to-End ASR [61.55606131634891]
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR)
The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism.
We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes
arXiv Detail & Related papers (2020-02-14T18:31:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.