Related papers: Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory

Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory

URL: http://arxiv.org/abs/2005.08042v1
Date: Sat, 16 May 2020 16:54:52 GMT
Title: Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory
Authors: Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang
Abstract summary: Transformer-based acoustic modeling has achieved great suc-cess for both hybrid and sequence-to-sequence speech recogni-tion. We propose a novel augmentedmemory self-attention, which attends on a short segment of theinput sequence and a bank of memories.
Score: 23.022723184325017
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based acoustic modeling has achieved great suc-cess for both hybrid and sequence-to-sequence speech recogni-tion. However, it requires access to the full sequence, and thecomputational cost grows quadratically with respect to the in-put sequence length. These factors limit its adoption for stream-ing applications. In this work, we proposed a novel augmentedmemory self-attention, which attends on a short segment of theinput sequence and a bank of memories. The memory bankstores the embedding information for all the processed seg-ments. On the librispeech benchmark, our proposed methodoutperforms all the existing streamable transformer methods bya large margin and achieved over 15% relative error reduction,compared with the widely used LC-BLSTM baseline. Our find-ings are also confirmed on some large internal datasets.

Related papers

Echo State Transformer: When chaos brings memory [2.07180164747172]
We introduce Echo State Transformers (EST), a hybrid architecture for sequential data processing.<n>EST integrates the Transformer attention mechanisms with Reservoir Computing principles to create a fixedsize window distributed memory system.<n>EST achieves constant computational complexity at each processing step, effectively breaking the quadratic scaling problem of standard Transformers.
arXiv Detail & Related papers (2025-06-25T09:56:25Z)
Compact Recurrent Transformer with Persistent Memory [16.48606806238812]
The Transformer architecture has shown significant success in many language processing and visual tasks.<n>We propose a novel and efficient Compact Recurrent Transformer (CRT)<n>CRT combines shallow Transformer models that process short local segments with recurrent neural networks to compress and manage a single persistent memory vector.<n>We evaluate CRT on WordPTB and WikiText-103 for next-token-prediction tasks, as well as on the Toyota Smarthome video dataset for classification.
arXiv Detail & Related papers (2025-05-02T00:11:44Z)
EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices [3.739419555718102]
Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs.
arXiv Detail & Related papers (2025-03-28T07:26:37Z)
Continual Low-Rank Scaled Dot-product Attention [67.11704350478475]
We introduce a new formulation of the Scaled-product Attention based on the Nystr"om approximation that is suitable for Continual Inference. In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude.
arXiv Detail & Related papers (2024-12-04T11:05:01Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z)
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings. We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm. UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z)
Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. Mask2Former uses 50% of its compute only on the transformer encoder. This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer. We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z)
Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z)
Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers [24.109312575970456]
We propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences. Our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps. We learn an effective hidden selection policy, which regards the decoders of transformers as environments.
arXiv Detail & Related papers (2023-08-25T05:52:05Z)
Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention. The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
arXiv Detail & Related papers (2023-05-24T03:47:22Z)
Self-Gated Memory Recurrent Network for Efficient Scalable HDR Deghosting [59.04604001936661]
We propose a novel recurrent network-based HDR deghosting method for fusing arbitrary length dynamic sequences. We introduce a new recurrent cell architecture, namely Self-Gated Memory (SGM) cell, that outperforms the standard LSTM cell. The proposed approach achieves state-of-the-art performance compared to existing HDR deghosting methods quantitatively across three publicly available datasets.
arXiv Detail & Related papers (2021-12-24T12:36:33Z)
Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition [8.046120977786702]
Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR) The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR. We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6% WER on test-clean) without external language models.
arXiv Detail & Related papers (2020-08-13T08:20:02Z)
Conformer-Kernel with Query Term Independence for Document Retrieval [32.36908635150144]
The Transformer- Kernel (TK) model has demonstrated strong reranking performance on the TREC Deep Learning benchmark. We extend the TK architecture to the full retrieval setting by incorporating the query term independence assumption. We show that the Conformer's GPU memory requirement scales linearly with input sequence length, making it a more viable option when ranking long documents.
arXiv Detail & Related papers (2020-07-20T19:47:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.