An approach to optimize inference of the DIART speaker diarization pipeline
- URL: http://arxiv.org/abs/2408.02341v1
- Date: Mon, 5 Aug 2024 09:38:07 GMT
- Title: An approach to optimize inference of the DIART speaker diarization pipeline
- Authors: Roman Aperdannier, Sigurd Schacht, Alexander Piazza,
- Abstract summary: Speaker diarization with low latency is referred to as online speaker diarization.
The DIART pipeline is an online speaker diarization system.
The aim of this paper is to optimize the inference latency of the DIART pipeline.
- Score: 44.99833362998488
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Speaker diarization answers the question "who spoke when" for an audio file. In some diarization scenarios, low latency is required for transcription. Speaker diarization with low latency is referred to as online speaker diarization. The DIART pipeline is an online speaker diarization system. It consists of a segmentation and an embedding model. The embedding model has the largest share of the overall latency. The aim of this paper is to optimize the inference latency of the DIART pipeline. Different inference optimization methods such as knowledge distilation, pruning, quantization and layer fusion are applied to the embedding model of the pipeline. It turns out that knowledge distillation optimizes the latency, but has a negative effect on the accuracy. Quantization and layer fusion also have a positive influence on the latency without worsening the accuracy. Pruning, on the other hand, does not improve latency.
Related papers
- Systematic Evaluation of Online Speaker Diarization Systems Regarding their Latency [44.99833362998488]
The latency is the time span from audio input to the output of the corresponding speaker label.
The lowest latency is achieved for the DIART-pipeline with the embedding model pyannote/embedding.
The FS-EEND system shows a similarly good latency.
arXiv Detail & Related papers (2024-07-05T06:54:27Z) - Short-Term Memory Convolutions [0.0]
We propose novel method for minimization of inference time latency and memory consumption, called Short-Term Memory Convolution (STMC)
The training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs)
In case of speech separation we achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality.
arXiv Detail & Related papers (2023-02-08T20:52:24Z) - Multi-mode Transformer Transducer with Stochastic Future Context [53.005638503544866]
Multi-mode speech recognition models can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can still achieve reliable accuracy.
We show that a Multi-mode ASR model rivals, if not surpasses, a set of competitive streaming baselines trained with different latency budgets.
arXiv Detail & Related papers (2021-06-17T18:42:11Z) - Transformer Transducer: One Model Unifying Streaming and Non-streaming
Speech Recognition [16.082949461807335]
We present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model.
We show that we can run this model in a Y-model architecture with the top layers running in parallel in low latency and high latency modes.
This allows us to have streaming speech recognition results with limited latency and delayed speech recognition results with large improvements in accuracy.
arXiv Detail & Related papers (2020-10-07T05:58:28Z) - Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning [60.20205278845412]
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised.
This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation.
We propose a reinforcement learning based framework to train an agent to make this decision.
arXiv Detail & Related papers (2020-08-07T11:48:05Z) - A Streaming On-Device End-to-End Model Surpassing Server-Side
Conventional Model Quality and Latency [88.08721721440429]
We develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer.
We find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model.
arXiv Detail & Related papers (2020-03-28T05:00:33Z) - Low Latency ASR for Simultaneous Speech Translation [27.213294097841853]
We have worked on several techniques for reducing the latency for both components, the automatic speech recognition and the speech translation module.
We combined run-on decoding with a technique for identifying stable partial hypotheses when stream decoding and a protocol for dynamic output update.
This combination reduces the latency at word level, where the words are final and will never be updated again in the future, from 18.1s to 1.1s without sacrificing performance.
arXiv Detail & Related papers (2020-03-22T13:37:05Z) - Scaling Up Online Speech Recognition Using ConvNets [33.75588539732141]
We design an online end-to-end speech recognition system based on Time-Depth Separable ( TDS) convolutions and Connectionist Temporal Classification (CTC)
We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy.
The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate.
arXiv Detail & Related papers (2020-01-27T12:55:02Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.