Related papers: Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning

Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning

URL: http://arxiv.org/abs/2404.10842v1
Date: Tue, 16 Apr 2024 18:40:28 GMT
Title: Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning
Authors: Amit Kumar Bhuyan, Hrishikesh Dutta, Subir Biswas,
Abstract summary: This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices. A Federated Learning model can identify the participants in a conversation without the requirement of a large audio database for training. An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings.
Score: 2.3076690318595676
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices. The work proposes a Federated Learning model which can identify the participants in a conversation without the requirement of a large audio database for training. An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings. Moreover, the proposed diarization system solves the problem of speaker change detection via. unsupervised segmentation techniques using Hotelling's t-squared Statistic and Bayesian Information Criterion. In this new approach, speaker change detection is biased around detected quasi-silences, which reduces the severity of the trade-off between the missed detection and false detection rates. Additionally, the computational overhead due to frame-by-frame identification of speakers is reduced via. unsupervised clustering of speech segments. The results demonstrate the effectiveness of the proposed training method in the presence of non-IID speech data. It also shows a considerable improvement in the reduction of false and missed detection at the segmentation stage, while reducing the computational overhead. Improved accuracy and reduced computational cost makes the mechanism suitable for real-time speaker diarization across a distributed IoT audio network.

Related papers

Robust Channel Learning for Large-Scale Radio Speaker Verification [30.332141166518287]
We present a Channel Robust Speaker Learning (CRSL) framework that enhances the robustness of the current speaker verification pipeline. Our framework introduces an augmentation module that mitigates bandwidth variations in radio speech datasets. We also propose an efficient fine-tuning method that reduces the need for extensive additional training time and large amounts of data.
arXiv Detail & Related papers (2024-06-16T14:17:57Z)
Leveraging Visual Supervision for Array-based Active Speaker Detection and Localization [3.836171323110284]
We show that a simple audio convolutional recurrent neural network can perform simultaneous horizontal active speaker detection and localization. We propose a new self-supervised training pipeline that embraces a student-teacher'' learning approach.
arXiv Detail & Related papers (2023-12-21T16:53:04Z)
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection [53.063161380423715]
Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. We propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection.
arXiv Detail & Related papers (2023-12-15T09:52:17Z)
Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method. Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z)
Improved Relation Networks for End-to-End Speaker Verification and Identification [0.0]
Speaker identification systems are tasked to identify a speaker amongst a set of enrolled speakers given just a few samples. We propose improved relation networks for speaker verification and few-shot (unseen) speaker identification. Inspired by the use of prototypical networks in speaker verification, we train the model to classify samples in the current episode amongst all speakers present in the training set.
arXiv Detail & Related papers (2022-03-31T17:44:04Z)
End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z)
Personalized Keyphrase Detection using Speaker and Environment Information [24.766475943042202]
We introduce a streaming keyphrase detection system that can be easily customized to accurately detect any phrase composed of words from a large vocabulary. The system is implemented with an end-to-end trained automatic speech recognition (ASR) model and a text-independent speaker verification model.
arXiv Detail & Related papers (2021-04-28T18:50:19Z)
Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences. Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness. This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z)
Augmentation adversarial training for self-supervised speaker recognition [49.47756927090593]
We train robust speaker recognition models without speaker labels. Experiments on VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision.
arXiv Detail & Related papers (2020-07-23T15:49:52Z)
Sparse Mixture of Local Experts for Efficient Speech Enhancement [19.645016575334786]
We investigate a deep learning approach for speech denoising through an efficient ensemble of specialist neural networks. By splitting up the speech denoising task into non-overlapping subproblems, we are able to improve denoising performance while also reducing computational complexity. Our findings demonstrate that a fine-tuned ensemble network is able to exceed the speech denoising capabilities of a generalist network.
arXiv Detail & Related papers (2020-05-16T23:23:22Z)
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS) A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.