Collar-aware Training for Streaming Speaker Change Detection in
Broadcast Speech
- URL: http://arxiv.org/abs/2205.07086v1
- Date: Sat, 14 May 2022 15:35:43 GMT
- Title: Collar-aware Training for Streaming Speaker Change Detection in
Broadcast Speech
- Authors: Joonas Kalda and Tanel Alum\"ae
- Abstract summary: We present a novel training method for speaker change detection models.
The proposed method uses an objective function which encourages the model to predict a single positive label within a specified collar.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a novel training method for speaker change
detection models. Speaker change detection is often viewed as a binary sequence
labelling problem. The main challenges with this approach are the vagueness of
annotated change points caused by the silences between speaker turns and
imbalanced data due to the majority of frames not including a speaker change.
Conventional training methods tackle these by artificially increasing the
proportion of positive labels in the training data. Instead, the proposed
method uses an objective function which encourages the model to predict a
single positive label within a specified collar. This is done by marginalizing
over all possible subsequences that have exactly one positive label within the
collar. Experiments on English and Estonian datasets show large improvements
over the conventional training method. Additionally, the model outputs have
peaks concentrated to a single frame, removing the need for post-processing to
find the exact predicted change point which is particularly useful for
streaming applications.
Related papers
- Early Stopping Against Label Noise Without Validation Data [54.27621957395026]
We propose a novel early stopping method called Label Wave, which does not require validation data for selecting the desired model.
We show both the effectiveness of the Label Wave method across various settings and its capability to enhance the performance of existing methods for learning with noisy labels.
arXiv Detail & Related papers (2025-02-11T13:40:15Z) - Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors.
In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Attention-based conditioning methods using variable frame rate for
style-robust speaker verification [21.607777746331998]
We propose an approach to extract speaker embeddings robust to speaking style variations in text-independent speaker verification.
An entropy-based variable frame rate vector is proposed as an external conditioning vector for the self-attention layer.
arXiv Detail & Related papers (2022-06-28T01:14:09Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Unsupervised Personalization of an Emotion Recognition System: The
Unique Properties of the Externalization of Valence in Speech [37.6839508524855]
Adapting a speech emotion recognition system to a particular speaker is a hard problem, especially with deep neural networks (DNNs)
This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set.
We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches.
arXiv Detail & Related papers (2022-01-19T22:14:49Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Prototypical Classifier for Robust Class-Imbalanced Learning [64.96088324684683]
We propose textitPrototypical, which does not require fitting additional parameters given the embedding network.
Prototypical produces balanced and comparable predictions for all classes even though the training set is class-imbalanced.
We test our method on CIFAR-10LT, CIFAR-100LT and Webvision datasets, observing that Prototypical obtains substaintial improvements compared with state of the arts.
arXiv Detail & Related papers (2021-10-22T01:55:01Z) - End-to-End Neural Diarization: Reformulating Speaker Diarization as
Simple Multi-label Classification [45.38809571153867]
We propose the End-to-End Neural Diarization (EEND) in which a neural network directly outputs speaker diarization results.
By feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations.
arXiv Detail & Related papers (2020-02-24T14:53:32Z) - DropClass and DropAdapt: Dropping classes for deep speaker
representation learning [33.60058873783114]
This work proposes two approaches to learning embeddings, based on the notion of dropping classes during training.
We demonstrate that both approaches can yield performance gains in speaker verification tasks.
arXiv Detail & Related papers (2020-02-02T18:43:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.