Related papers: Streaming Sequence Transduction through Dynamic Compression

Streaming Sequence Transduction through Dynamic Compression

URL: http://arxiv.org/abs/2402.01172v1
Date: Fri, 2 Feb 2024 06:31:50 GMT
Title: Streaming Sequence Transduction through Dynamic Compression
Authors: Weiting Tan, Yunmo Chen, Tongfei Chen, Guanghui Qin, Haoran Xu, Heidi C. Zhang, Benjamin Van Durme, Philipp Koehn
Abstract summary: We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.
Score: 55.0083843520833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR) and outperforming existing methods. Moreover, STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.

Related papers

Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement [52.89324095217975]
We propose a first streaming accent conversion model that transforms non-native speech into a native-like accent.<n>Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism.
arXiv Detail & Related papers (2025-06-19T20:05:29Z)
Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling [76.23539797803681]
Existing methods primarily use a look mechanism, relying on future text to achieve natural streaming speech synthesis.<n>We propose LE, a streaming framework for generating high-quality speech frame-by-frame.<n> Experimental results suggest that the LE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems.
arXiv Detail & Related papers (2025-05-26T08:25:01Z)
Streaming Piano Transcription Based on Consistent Onset and Offset Decoding with Sustain Pedal Detection [10.607017917148996]
This paper describes a streaming audio-to-MIDI piano transcription approach that aims to sequentially translate a music signal into a sequence of note onset and offset events. Experiments using the MAESTRO dataset showed that the proposed streaming method performed comparably with or even better than the state-of-the-art offline methods.
arXiv Detail & Related papers (2025-03-03T09:55:54Z)
Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z)
Efficient Encoders for Streaming Sequence Tagging [13.692806815196077]
A naive application of state-of-the-art bidirectional encoders for streaming sequence tagging would require encoding each token from scratch for each new token in an incremental streaming input (like transcribed speech) The lack of re-usability of previous computation leads to a higher number of Floating Point Operations (or FLOPs) and higher number of unnecessary label flips. We present a Hybrid with Adaptive Restart (HEAR) that addresses these issues while maintaining the performance of bidirectional encoders over the offline (or complete) inputs.
arXiv Detail & Related papers (2023-01-23T02:20:39Z)
Streaming Align-Refine for Non-autoregressive Deliberation [42.748839817396046]
We propose a streaming non-autoregressive (non-AR) decoding algorithm to deliberate the hypothesis alignment of a streaming RNN-T model. Our algorithm facilitates a simple greedy decoding procedure, and at the same time is capable of producing the decoding result at each frame with limited right context. Experiments on voice search datasets and Librispeech show that with reasonable right context, our streaming model performs as well as the offline counterpart.
arXiv Detail & Related papers (2022-04-15T17:24:39Z)
Deliberation of Streaming RNN-Transducer by Non-autoregressive Decoding [21.978994865937786]
The method performs a few refinement steps, where each step shares a transformer decoder that attends to both text features and audio features. We show that, conditioned on hypothesis alignments of a streaming RNN-T model, our method obtains significantly more accurate recognition results than the first-pass RNN-T.
arXiv Detail & Related papers (2021-12-01T01:34:28Z)
Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. We propose a novel end-to-end streaming NAR speech recognition system. We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z)
Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks. However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR. We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z)
Streaming Simultaneous Speech Translation with Augmented Memory Transformer [29.248366441276662]
Transformer-based models have achieved state-of-the-art performance on speech translation tasks. We propose an end-to-end transformer-based sequence-to-sequence model, equipped with an augmented memory transformer encoder.
arXiv Detail & Related papers (2020-10-30T18:28:42Z)
Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder. In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR. We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.