A Review of Speaker Diarization: Recent Advances with Deep Learning
- URL: http://arxiv.org/abs/2101.09624v1
- Date: Sun, 24 Jan 2021 01:28:05 GMT
- Title: A Review of Speaker Diarization: Recent Advances with Deep Learning
- Authors: Tae Jin Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji
Watanabe, Shrikanth Narayanan
- Abstract summary: Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity.
With the rise of deep learning technology, more rapid advancements have been made for speaker diarization.
We discuss how speaker diarization systems have been integrated with speech recognition applications.
- Score: 78.20151731627958
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker diarization is a task to label audio or video recordings with classes
corresponding to speaker identity, or in short, a task to identify "who spoke
when". In the early years, speaker diarization algorithms were developed for
speech recognition on multi-speaker audio recordings to enable speaker adaptive
processing, but also gained its own value as a stand-alone application over
time to provide speaker-specific meta information for downstream tasks such as
audio retrieval. More recently, with the rise of deep learning technology that
has been a driving force to revolutionary changes in research and practices
across speech application domains in the past decade, more rapid advancements
have been made for speaker diarization. In this paper, we review not only the
historical development of speaker diarization technology but also the recent
advancements in neural speaker diarization approaches. We also discuss how
speaker diarization systems have been integrated with speech recognition
applications and how the recent surge of deep learning is leading the way of
jointly modeling these two components to be complementary to each other. By
considering such exciting technical trends, we believe that it is a valuable
contribution to the community to provide a survey work by consolidating the
recent developments with neural methods and thus facilitating further progress
towards a more efficient speaker diarization.
Related papers
- Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization [25.213694510527436]
Most existing speaker diarization systems rely exclusively on unimodal acoustic information.
We propose a novel multimodal approach that jointly utilizes audio, visual, and semantic cues to enhance speaker diarization.
Our approach consistently outperforms state-of-the-art speaker diarization methods.
arXiv Detail & Related papers (2024-08-22T03:34:03Z) - Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition [27.35304346509647]
We introduce speaker labels into an autoregressive transformer-based speech recognition model.
We then propose a novel speaker mask branch to detection the speech segments of individual speakers.
With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously.
arXiv Detail & Related papers (2023-12-18T06:29:53Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary
Network [28.661704280484457]
We propose Word-level End-to-End Neural Diarization (WEEND) with auxiliary network.
We find WEEND has the potential to deliver high quality diarized text.
arXiv Detail & Related papers (2023-09-15T15:48:45Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.