Generation of Speaker Representations Using Heterogeneous Training Batch
Assembly
- URL: http://arxiv.org/abs/2203.16646v1
- Date: Wed, 30 Mar 2022 19:59:05 GMT
- Title: Generation of Speaker Representations Using Heterogeneous Training Batch
Assembly
- Authors: Yu-Huai Peng, Hung-Shin Lee, Pin-Tuan Huang, Hsin-Min Wang
- Abstract summary: We propose a new CNN-based speaker modeling scheme.
We randomly and synthetically augment the training data into a set of segments.
A soft label is imposed on each segment based on its speaker occupation ratio.
- Score: 16.534380339042087
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In traditional speaker diarization systems, a well-trained speaker model is a
key component to extract representations from consecutive and partially
overlapping segments in a long speech session. To be more consistent with the
back-end segmentation and clustering, we propose a new CNN-based speaker
modeling scheme, which takes into account the heterogeneity of the speakers in
each training segment and batch. We randomly and synthetically augment the
training data into a set of segments, each of which contains more than one
speaker and some overlapping parts. A soft label is imposed on each segment
based on its speaker occupation ratio, and the standard cross entropy loss is
implemented in model training. In this way, the speaker model should have the
ability to generate a geometrically meaningful embedding for each multi-speaker
segment. Experimental results show that our system is superior to the baseline
system using x-vectors in two speaker diarization tasks. In the CALLHOME task
trained on the NIST SRE and Switchboard datasets, our system achieves a
relative reduction of 12.93% in DER. In Track 2 of CHiME-6, our system provides
13.24%, 12.60%, and 5.65% relative reductions in DER, JER, and WER,
respectively.
Related papers
- Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - High-resolution embedding extractor for speaker diarisation [15.392429990363492]
This study proposes a novel embedding extractor architecture, referred to as a high-resolution embedding extractor (HEE)
HEE consists of a feature-map extractor and an enhancer, where the enhancer with the self-attention mechanism is the key to success.
Through experiments on five evaluation sets, including four public datasets, the proposed HEE demonstrates at least 10% improvement on each evaluation set.
arXiv Detail & Related papers (2022-11-08T07:41:18Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Multi-scale Speaker Diarization with Dynamic Scale Weighting [14.473173007997751]
We propose a more advanced multi-scale diarization system based on a multi-scale diarization decoder.
Our proposed system achieves a state-of-art performance on the CALLHOME and AMI MixHeadset datasets, with 3.92% and 1.05% diarization error rates, respectively.
arXiv Detail & Related papers (2022-03-30T01:26:31Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints [36.07346889498981]
We propose GC-TTS which achieves high-quality speaker adaptation with significantly improved speaker similarity.
A TTS model is pre-trained for base speakers with a sufficient amount of data, and then fine-tuned for novel speakers on a few minutes of data with two geometric constraints.
The experimental results demonstrate that GC-TTS generates high-quality speech from only a few minutes of training data, outperforming standard techniques in terms of speaker similarity to the target speaker.
arXiv Detail & Related papers (2021-08-16T04:25:31Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Combination of Deep Speaker Embeddings for Diarisation [9.053645441056256]
This paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different NN components.
A neural-based single-pass speaker diarisation pipeline is also proposed in this paper.
Experiments and detailed analyses are conducted on the challenging AMI and NIST RT05 datasets.
arXiv Detail & Related papers (2020-10-22T20:16:36Z) - Speaker diarization with session-level speaker embedding refinement
using graph neural networks [26.688724154619504]
We present the first use of graph neural networks (GNNs) for the speaker diarization problem, utilizing a GNN to refine speaker embeddings locally.
The speaker embeddings extracted by a pre-trained model are remapped into a new embedding space, in which the different speakers within a single session are better separated.
We show that the clustering performance of the refined speaker embeddings outperforms the original embeddings significantly on both simulated and real meeting data.
arXiv Detail & Related papers (2020-05-22T19:52:51Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.