Joint speaker diarisation and tracking in switching state-space model
- URL: http://arxiv.org/abs/2109.11140v1
- Date: Thu, 23 Sep 2021 04:43:58 GMT
- Title: Joint speaker diarisation and tracking in switching state-space model
- Authors: Jeremy H. M. Wong and Yifan Gong
- Abstract summary: This paper proposes to explicitly track the movements of speakers while jointly performing diarisation within a unified model.
A state-space model is proposed, where the hidden state expresses the identity of the current active speaker and the predicted locations of all speakers.
Experiments on a Microsoft rich meeting transcription task show that the proposed joint location tracking and diarisation approach is able to perform comparably with other methods that use location information.
- Score: 51.58295550366401
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Speakers may move around while diarisation is being performed. When a
microphone array is used, the instantaneous locations of where the sounds
originated from can be estimated, and previous investigations have shown that
such information can be complementary to speaker embeddings in the diarisation
task. However, these approaches often assume that speakers are fairly
stationary throughout a meeting. This paper relaxes this assumption, by
proposing to explicitly track the movements of speakers while jointly
performing diarisation within a unified model. A state-space model is proposed,
where the hidden state expresses the identity of the current active speaker and
the predicted locations of all speakers. The model is implemented as a particle
filter. Experiments on a Microsoft rich meeting transcription task show that
the proposed joint location tracking and diarisation approach is able to
perform comparably with other methods that use location information.
Related papers
- Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Multi-microphone Automatic Speech Segmentation in Meetings Based on
Circular Harmonics Features [0.0]
We propose a new set of spatial features based on direction-of-arrival estimations in the circular harmonic domain (CH-DOA)
Experiments on the AMI meeting corpus show that CH-DOA can improve the segmentation while being robust in the case of deactivated microphones.
arXiv Detail & Related papers (2023-06-07T09:09:00Z) - Diarisation using location tracking with agglomerative clustering [42.13772744221499]
This paper explicitly models the movements of speakers within an Agglomerative Hierarchical Clustering (AHC) diarisation framework.
Experiments show that the proposed approach is able to yield improvements on a Microsoft rich meeting transcription task.
arXiv Detail & Related papers (2021-09-22T08:54:10Z) - A Real-time Speaker Diarization System Based on Spatial Spectrum [14.189768987932364]
We propose a novel systematic approach to tackle several long-standing challenges in speaker diarization tasks.
First, a differential directional microphone array-based approach is exploited to capture the target speakers' voice in far-field adverse environment.
Second, an online speaker-location joint clustering approach is proposed to keep track of speaker location.
Third, an instant speaker number detector is developed to trigger the mechanism that separates overlapped speech.
arXiv Detail & Related papers (2021-07-20T08:25:23Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.