In search of strong embedding extractors for speaker diarisation
- URL: http://arxiv.org/abs/2210.14682v1
- Date: Wed, 26 Oct 2022 13:00:29 GMT
- Title: In search of strong embedding extractors for speaker diarisation
- Authors: Jee-weon Jung, Hee-Soo Heo, Bong-Jin Lee, Jaesung Huh, Andrew Brown,
Youngki Kwon, Shinji Watanabe, Joon Son Chung
- Abstract summary: We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
- Score: 49.7017388682077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker embedding extractors (EEs), which map input audio to a speaker
discriminant latent space, are of paramount importance in speaker diarisation.
However, there are several challenges when adopting EEs for diarisation, from
which we tackle two key problems. First, the evaluation is not straightforward
because the features required for better performance differ between speaker
verification and diarisation. We show that better performance on widely adopted
speaker verification evaluation protocols does not lead to better diarisation
performance. Second, embedding extractors have not seen utterances in which
multiple speakers exist. These inputs are inevitably present in speaker
diarisation because of overlapped speech and speaker changes; they degrade the
performance. To mitigate the first problem, we generate speaker verification
evaluation protocols that mimic the diarisation scenario better. We propose two
data augmentation techniques to alleviate the second problem, making embedding
extractors aware of overlapped speech or speaker change input. One technique
generates overlapped speech segments, and the other generates segments where
two speakers utter sequentially. Extensive experimental results using three
state-of-the-art speaker embedding extractors demonstrate that both proposed
approaches are effective.
Related papers
- Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition [27.35304346509647]
We introduce speaker labels into an autoregressive transformer-based speech recognition model.
We then propose a novel speaker mask branch to detection the speech segments of individual speakers.
With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously.
arXiv Detail & Related papers (2023-12-18T06:29:53Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Leveraging speaker attribute information using multi task learning for
speaker verification and diarization [33.60058873783114]
We propose a framework for making use of auxiliary label information, even when it is only available for speech corpora mismatched to the target application.
We show that by leveraging two additional forms of speaker attribute information, we improve the performance of our deep speaker embeddings for both verification and diarization tasks.
arXiv Detail & Related papers (2020-10-27T13:10:51Z) - Speaker Separation Using Speaker Inventories and Estimated Speech [78.57067876891253]
We propose speaker separation using speaker inventories (SSUSI) and speaker separation using estimated speech (SSUES)
By combining the advantages of permutation invariant training (PIT) and speech extraction, SSUSI significantly outperforms conventional approaches.
arXiv Detail & Related papers (2020-10-20T18:15:45Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Speaker Re-identification with Speaker Dependent Speech Enhancement [37.33388614967888]
This paper introduces a novel approach that cascades speech enhancement and speaker recognition.
The proposed approach is evaluated using the Voxceleb1 dataset, which aims to assess speaker recognition in real world situations.
arXiv Detail & Related papers (2020-05-15T23:02:10Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.