Related papers: Real-time Speech Interruption Analysis: From Cloud to Client Deployment

Real-time Speech Interruption Analysis: From Cloud to Client Deployment

URL: http://arxiv.org/abs/2210.13334v1
Date: Mon, 24 Oct 2022 15:39:51 GMT
Title: Real-time Speech Interruption Analysis: From Cloud to Client Deployment
Authors: Quchen Fu, Szu-Wei Fu, Yaran Fan, Yu Wu, Zhuo Chen, Jayant Gupchup, Ross Cutler
Abstract summary: We have recently developed the first speech interruption analysis model, which detects failed speech interruptions. To deliver this feature in a more cost-efficient and environment-friendly way, we reduced the model complexity and size to ship the WavLM_SI model in client devices.
Score: 20.694024217864783
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Meetings are an essential form of communication for all types of organizations, and remote collaboration systems have been much more widely used since the COVID-19 pandemic. One major issue with remote meetings is that it is challenging for remote participants to interrupt and speak. We have recently developed the first speech interruption analysis model, which detects failed speech interruptions, shows very promising performance, and is being deployed in the cloud. To deliver this feature in a more cost-efficient and environment-friendly way, we reduced the model complexity and size to ship the WavLM_SI model in client devices. In this paper, we first describe how we successfully improved the True Positive Rate (TPR) at a 1% False Positive Rate (FPR) from 50.9% to 68.3% for the failed speech interruption detection model by training on a larger dataset and fine-tuning. We then shrank the model size from 222.7 MB to 9.3 MB with an acceptable loss in accuracy and reduced the complexity from 31.2 GMACS (Giga Multiply-Accumulate Operations per Second) to 4.3 GMACS. We also estimated the environmental impact of the complexity reduction, which can be used as a general guideline for large Transformer-based models, and thus make those models more accessible with less computation overhead.

Related papers

Comet: A Communication-efficient and Performant Approximation for Private Transformer Inference [16.328220661765744]
We introduce a novel plug-in method Comet to reduce the communication cost without compromising the inference performance. We evaluate our Comet on Bert and RoBERTa models with GLUE benchmark datasets, showing up to 3.9$times$ less communication and 3.5$times$ speedups.
arXiv Detail & Related papers (2024-05-24T18:43:00Z)
ResFed: Communication Efficient Federated Learning by Transmitting Deep Compressed Residuals [24.13593410107805]
Federated learning enables cooperative training among massively distributed clients by sharing their learned local model parameters. We introduce a residual-based federated learning framework (ResFed), where residuals rather than model parameters are transmitted in communication networks for training. By employing a common prediction rule, both locally and globally updated models are always fully recoverable in clients and the server.
arXiv Detail & Related papers (2022-12-11T20:34:52Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
Error Detection in Large-Scale Natural Language Understanding Systems Using Transformer Models [0.0]
Large-scale conversational assistants like Alexa, Siri, Cortana and Google Assistant process every utterance using multiple models for domain, intent and named entity recognition. We address this challenge to detect domain classification errors using offline Transformer models. We combine utterance encodings from a RoBERTa model with the Nbest hypothesis produced by the production system. We then fine-tune end-to-end in a multitask setting using a small dataset of humanannotated utterances with domain classification errors.
arXiv Detail & Related papers (2021-09-04T00:10:48Z)
Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes. The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z)
Small footprint Text-Independent Speaker Verification for Embedded Systems [7.123796359179192]
We present a two-stage model architecture orders of magnitude smaller than common solutions for speaker verification. We demonstrate the possibility of running our solution on small devices typical of IoT systems such as the Raspberry Pi 3B with a latency smaller than 200ms on a 5s long utterance.
arXiv Detail & Related papers (2020-11-03T13:53:05Z)
Simplified Self-Attention for Transformer-based End-to-End Speech Recognition [56.818507476125895]
We propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors. We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks.
arXiv Detail & Related papers (2020-05-21T04:55:59Z)
Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR) We propose the convolution-augmented transformer for speech recognition, named Conformer. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z)
RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions [73.45995446500312]
We analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models. We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference.
arXiv Detail & Related papers (2020-05-07T06:24:47Z)
End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
Attention based on-device streaming speech recognition with large speech corpus [16.702653972113023]
We present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.
arXiv Detail & Related papers (2020-01-02T04:24:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.