Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming
Disfluency Detection
- URL: http://arxiv.org/abs/2205.00620v1
- Date: Mon, 2 May 2022 02:13:24 GMT
- Title: Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming
Disfluency Detection
- Authors: Angelica Chen, Vicky Zayats, Daniel D. Walker, Dirk Padfield
- Abstract summary: Streaming BERT-based sequence tagging model is capable of detecting disfluencies in real-time.
Model attains state-of-the-art latency and stability scores when compared with recent work on incremental disfluency detection.
- Score: 3.884530687475798
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In modern interactive speech-based systems, speech is consumed and
transcribed incrementally prior to having disfluencies removed. This
post-processing step is crucial for producing clean transcripts and high
performance on downstream tasks (e.g. machine translation). However, most
current state-of-the-art NLP models such as the Transformer operate
non-incrementally, potentially causing unacceptable delays. We propose a
streaming BERT-based sequence tagging model that, combined with a novel
training objective, is capable of detecting disfluencies in real-time while
balancing accuracy and latency. This is accomplished by training the model to
decide whether to immediately output a prediction for the current input or to
wait for further context. Essentially, the model learns to dynamically size its
lookahead window. Our results demonstrate that our model produces comparably
accurate predictions and does so sooner than our baselines, with lower flicker.
Furthermore, the model attains state-of-the-art latency and stability scores
when compared with recent work on incremental disfluency detection.
Related papers
- Test-Time Model Adaptation with Only Forward Passes [68.11784295706995]
Test-time adaptation has proven effective in adapting a given trained model to unseen test samples with potential distribution shifts.
We propose a test-time Forward-Optimization Adaptation (FOA) method.
FOA runs on quantized 8-bit ViT, outperforms gradient-based TENT on full-precision 32-bit ViT, and achieves an up to 24-fold memory reduction on ImageNet-C.
arXiv Detail & Related papers (2024-04-02T05:34:33Z) - The Missing U for Efficient Diffusion Models [3.712196074875643]
Diffusion Probabilistic Models yield record-breaking performance in tasks such as image synthesis, video generation, and molecule design.
Despite their capabilities, their efficiency, especially in the reverse process, remains a challenge due to slow convergence rates and high computational costs.
We introduce an approach that leverages continuous dynamical systems to design a novel denoising network for diffusion models.
arXiv Detail & Related papers (2023-10-31T00:12:14Z) - Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context.
Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z) - Post-Processing Temporal Action Detection [134.26292288193298]
Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence.
This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original temporal resolution.
We introduce a novel model-agnostic post-processing method without model redesign and retraining.
arXiv Detail & Related papers (2022-11-27T19:50:37Z) - StreamYOLO: Real-time Object Detection for Streaming Perception [84.2559631820007]
We endow the models with the capacity of predicting the future, significantly improving the results for streaming perception.
We consider multiple velocities driving scene and propose Velocity-awared streaming AP (VsAP) to jointly evaluate the accuracy.
Our simple method achieves the state-of-the-art performance on Argoverse-HD dataset and improves the sAP and VsAP by 4.7% and 8.2% respectively.
arXiv Detail & Related papers (2022-07-21T12:03:02Z) - Real-time Object Detection for Streaming Perception [84.2559631820007]
Streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception.
We build a simple and effective framework for streaming perception.
Our method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline.
arXiv Detail & Related papers (2022-03-23T11:33:27Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - Disfluency Detection with Unlabeled Data and Small BERT Models [3.04133054437883]
We focus on the disfluency detection task, focusing on small, fast, on-device models based on the BERT architecture.
We demonstrate it is possible to train disfluency detection models as small as 1.3 MiB, while retaining high performance.
arXiv Detail & Related papers (2021-04-21T21:24:32Z) - Language Models not just for Pre-training: Fast Online Neural Noisy
Channel Modeling [35.43382144290393]
We introduce efficient approximations to make inference with the noisy channel approach as fast as strong ensembles.
We also show that the noisy channel approach can outperform strong pre-training results by achieving a new state of the art on WMT Romanian-English translation.
arXiv Detail & Related papers (2020-11-13T23:22:28Z) - Controllable Time-Delay Transformer for Real-Time Punctuation Prediction
and Disfluency Detection [10.265607222257263]
We propose a Controllable Time-delay Transformer (CT-Transformer) model that jointly completes the punctuation prediction and disfluency detection tasks in real time.
The proposed approach outperforms the previous state-of-the-art models on F-scores and achieves a competitive inference speed.
arXiv Detail & Related papers (2020-03-03T03:17:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.