A Real-time Speaker Diarization System Based on Spatial Spectrum
- URL: http://arxiv.org/abs/2107.09321v1
- Date: Tue, 20 Jul 2021 08:25:23 GMT
- Title: A Real-time Speaker Diarization System Based on Spatial Spectrum
- Authors: Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng,
Zhijie Yan
- Abstract summary: We propose a novel systematic approach to tackle several long-standing challenges in speaker diarization tasks.
First, a differential directional microphone array-based approach is exploited to capture the target speakers' voice in far-field adverse environment.
Second, an online speaker-location joint clustering approach is proposed to keep track of speaker location.
Third, an instant speaker number detector is developed to trigger the mechanism that separates overlapped speech.
- Score: 14.189768987932364
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we describe a speaker diarization system that enables
localization and identification of all speakers present in a conversation or
meeting. We propose a novel systematic approach to tackle several long-standing
challenges in speaker diarization tasks: (1) to segment and separate
overlapping speech from two speakers; (2) to estimate the number of speakers
when participants may enter or leave the conversation at any time; (3) to
provide accurate speaker identification on short text-independent utterances;
(4) to track down speakers movement during the conversation; (5) to detect
speaker change incidence real-time. First, a differential directional
microphone array-based approach is exploited to capture the target speakers'
voice in far-field adverse environment. Second, an online speaker-location
joint clustering approach is proposed to keep track of speaker location. Third,
an instant speaker number detector is developed to trigger the mechanism that
separates overlapped speech. The results suggest that our system effectively
incorporates spatial information and achieves significant gains.
Related papers
- Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition [27.35304346509647]
We introduce speaker labels into an autoregressive transformer-based speech recognition model.
We then propose a novel speaker mask branch to detection the speech segments of individual speakers.
With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously.
arXiv Detail & Related papers (2023-12-18T06:29:53Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Royalflush Speaker Diarization System for ICASSP 2022 Multi-channel
Multi-party Meeting Transcription Challenge [4.022057598291766]
Royalflush speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription Challenge.
System comprises speech enhancement, overlapped speech detection, speaker embedding extraction, speaker clustering, speech separation and system fusion.
arXiv Detail & Related papers (2022-02-10T03:35:05Z) - Joint speaker diarisation and tracking in switching state-space model [51.58295550366401]
This paper proposes to explicitly track the movements of speakers while jointly performing diarisation within a unified model.
A state-space model is proposed, where the hidden state expresses the identity of the current active speaker and the predicted locations of all speakers.
Experiments on a Microsoft rich meeting transcription task show that the proposed joint location tracking and diarisation approach is able to perform comparably with other methods that use location information.
arXiv Detail & Related papers (2021-09-23T04:43:58Z) - A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity.
With the rise of deep learning technology, more rapid advancements have been made for speaker diarization.
We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.