Partial Rewriting for Multi-Stage ASR
- URL: http://arxiv.org/abs/2312.09463v1
- Date: Fri, 8 Dec 2023 00:31:43 GMT
- Title: Partial Rewriting for Multi-Stage ASR
- Authors: Antoine Bruguier and David Qiu and Yanzhang He
- Abstract summary: We improve the quality of streaming results by around 10%, without altering the final results.
Our approach introduces no additional latency and reduces flickering.
It is also lightweight, does not require retraining the model, and it can be applied to a wide variety of multi-stage architectures.
- Score: 14.642804773149713
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For many streaming automatic speech recognition tasks, it is important to
provide timely intermediate streaming results, while refining a high quality
final result. This can be done using a multi-stage architecture, where a small
left-context only model creates streaming results and a larger left- and
right-context model produces a final result at the end. While this
significantly improves the quality of the final results without compromising
the streaming emission latency of the system, streaming results do not benefit
from the quality improvements. Here, we propose using a text manipulation
algorithm that merges the streaming outputs of both models. We improve the
quality of streaming results by around 10%, without altering the final results.
Our approach introduces no additional latency and reduces flickering. It is
also lightweight, does not require retraining the model, and it can be applied
to a wide variety of multi-stage architectures.
Related papers
- RAIN: Real-time Animation of Infinite Video Stream [52.97171098038888]
RAIN is a pipeline solution capable of animating infinite video streams in real-time with low latency.
RAIN generates video frames with much shorter latency and faster speed, while maintaining long-range attention over extended video streams.
RAIN can animate characters in real-time with much better quality, accuracy, and consistency than competitors.
arXiv Detail & Related papers (2024-12-27T07:13:15Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context.
Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z) - Streaming Align-Refine for Non-autoregressive Deliberation [42.748839817396046]
We propose a streaming non-autoregressive (non-AR) decoding algorithm to deliberate the hypothesis alignment of a streaming RNN-T model.
Our algorithm facilitates a simple greedy decoding procedure, and at the same time is capable of producing the decoding result at each frame with limited right context.
Experiments on voice search datasets and Librispeech show that with reasonable right context, our streaming model performs as well as the offline counterpart.
arXiv Detail & Related papers (2022-04-15T17:24:39Z) - Investigating Tradeoffs in Real-World Video Super-Resolution [90.81396836308085]
Real-world video super-resolution (VSR) models are often trained with diverse degradations to improve generalizability.
To alleviate the first tradeoff, we propose a degradation scheme that reduces up to 40% of training time without sacrificing performance.
To facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences.
arXiv Detail & Related papers (2021-11-24T18:58:21Z) - Multi-Density Attention Network for Loop Filtering in Video Compression [9.322800480045336]
We propose a on-line scaling based multi-density attention network for loop filtering in video compression.
Experimental results show that 10.18% bit-rate reduction at the same video quality can be achieved over the latest Versatile Video Coding (VVC) standard.
arXiv Detail & Related papers (2021-04-08T05:46:38Z) - Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech
Recognition [19.971343876930767]
We present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model.
Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified.
Experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently.
arXiv Detail & Related papers (2020-12-10T06:54:54Z) - Cascaded encoders for unifying streaming and non-streaming ASR [68.62941009369125]
This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously.
A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder.
Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% -- 27% relative improvement when operating in non-streaming mode.
arXiv Detail & Related papers (2020-10-27T20:59:50Z) - Transformer Transducer: One Model Unifying Streaming and Non-streaming
Speech Recognition [16.082949461807335]
We present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model.
We show that we can run this model in a Y-model architecture with the top layers running in parallel in low latency and high latency modes.
This allows us to have streaming speech recognition results with limited latency and delayed speech recognition results with large improvements in accuracy.
arXiv Detail & Related papers (2020-10-07T05:58:28Z) - Deep Space-Time Video Upsampling Networks [47.62807427163614]
Video super-resolution (VSR) and frame (FI) are traditional computer vision problems.
We propose an end-to-end framework for the space-time video upsampling by efficiently merging VSR and FI into a joint framework.
Results show better results both quantitatively and qualitatively, while reducing the time (x7 faster) and the number of parameters (30%) compared to baselines.
arXiv Detail & Related papers (2020-04-06T07:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.