Systematic Evaluation of Online Speaker Diarization Systems Regarding their Latency
- URL: http://arxiv.org/abs/2407.04293v1
- Date: Fri, 5 Jul 2024 06:54:27 GMT
- Title: Systematic Evaluation of Online Speaker Diarization Systems Regarding their Latency
- Authors: Roman Aperdannier, Sigurd Schacht, Alexander Piazza,
- Abstract summary: The latency is the time span from audio input to the output of the corresponding speaker label.
The lowest latency is achieved for the DIART-pipeline with the embedding model pyannote/embedding.
The FS-EEND system shows a similarly good latency.
- Score: 44.99833362998488
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper, different online speaker diarization systems are evaluated on the same hardware with the same test data with regard to their latency. The latency is the time span from audio input to the output of the corresponding speaker label. As part of the evaluation, various model combinations within the DIART framework, a diarization system based on the online clustering algorithm UIS-RNN-SML, and the end-to-end online diarization system FS-EEND are compared. The lowest latency is achieved for the DIART-pipeline with the embedding model pyannote/embedding and the segmentation model pyannote/segmentation. The FS-EEND system shows a similarly good latency. In general there is currently no published research that compares several online diarization systems in terms of their latency. This makes this work even more relevant.
Related papers
- An approach to optimize inference of the DIART speaker diarization pipeline [44.99833362998488]
Speaker diarization with low latency is referred to as online speaker diarization.
The DIART pipeline is an online speaker diarization system.
The aim of this paper is to optimize the inference latency of the DIART pipeline.
arXiv Detail & Related papers (2024-08-05T09:38:07Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - BeatNet: CRNN and Particle Filtering for Online Joint Beat Downbeat and
Meter Tracking [21.352141245632247]
We introduce an online system for joint beat, downbeat, and meter tracking, which utilizes causal convolutional and recurrent layers.
The proposed system does not need to be primed with a time signature in order to perform downbeat tracking, and is instead able to estimate meter and adjust the predictions over time.
Experiments on the GTZAN dataset, which is unseen during training, show that the system outperforms various online beat and downbeat tracking systems.
arXiv Detail & Related papers (2021-08-08T06:07:59Z) - Dissecting User-Perceived Latency of On-Device E2E Speech Recognition [34.645194215436966]
We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL)
We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
arXiv Detail & Related papers (2021-04-06T00:55:11Z) - A low latency ASR-free end to end spoken language understanding system [11.413018142161249]
This work proposes a system that has a small enough footprint to run on small micro-controllers and embedded systems with minimal latency.
Given a streaming input speech signal, the proposed system can process it segment-by-segment without the need to have the entire stream at the moment of processing.
Experiments show that the proposed system yields state-of-the-art performance with the advantage of low latency and a much smaller model when compared to other published works on the same task.
arXiv Detail & Related papers (2020-11-10T04:16:56Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Scaling Up Online Speech Recognition Using ConvNets [33.75588539732141]
We design an online end-to-end speech recognition system based on Time-Depth Separable ( TDS) convolutions and Connectionist Temporal Classification (CTC)
We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy.
The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate.
arXiv Detail & Related papers (2020-01-27T12:55:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.