Related papers: Learning Monotonic Attention in Transducer for Streaming Generation

Learning Monotonic Attention in Transducer for Streaming Generation

URL: http://arxiv.org/abs/2411.17170v1
Date: Tue, 26 Nov 2024 07:19:26 GMT
Title: Learning Monotonic Attention in Transducer for Streaming Generation
Authors: Zhengrui Ma, Yang Feng, Min Zhang,
Abstract summary: We propose a learnable monotonic attention mechanism to handle non-monotonic alignments in Transducer-based streaming generation models. Our approach allows Transducer models to adaptively adjust the scope of attention based on their predictions, avoiding the need to enumerate the exponentially large alignment space.
Score: 26.24357071901915
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Streaming generation models are increasingly utilized across various fields, with the Transducer architecture being particularly popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation, leading to suboptimal performance in these contexts. In this research, we address this issue by tightly integrating Transducer's decoding with the history of input stream via a learnable monotonic attention mechanism. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the context representations of monotonic attention in training. This allows Transducer models to adaptively adjust the scope of attention based on their predictions, avoiding the need to enumerate the exponentially large alignment space. Extensive experiments demonstrate that our MonoAttn-Transducer significantly enhances the handling of non-monotonic alignments in streaming generation, offering a robust solution for Transducer-based frameworks to tackle more complex streaming generation tasks.

Related papers

Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers [9.617245548268437]
Large Language Models (LLMs) are trained with next-token prediction, implemented in autoregressive Transformers.<n>This creates a subtle misalignment: residual connections tie activations to the current token, while supervision targets the next token.<n>We propose a lightweight residual-path mitigation based on residual attenuation, implemented either as a fixed-layer intervention or as a learnable gating mechanism.
arXiv Detail & Related papers (2026-02-16T14:04:42Z)
Real-Time Inference for Distributed Multimodal Systems under Communication Delay Uncertainty [37.15356899831919]
Connected cyber-physical systems perform inference based on real-time inputs from multiple data streams.<n>We propose a novel neuro-inspired non-blocking inference paradigm that employs adaptive temporal windows of integration.<n>Our framework achieves robust real-time inference with finer-grained control over the accuracy-latency tradeoff.
arXiv Detail & Related papers (2025-11-20T10:48:54Z)
Self-Speculative Biased Decoding for Faster Live Translation [0.0]
Self-Speculative Biased Decoding is a novel inference paradigm designed to avoid repeatedly generating output from scratch for a consistently growing input stream.<n>We show that our approach achieves up to 1.7x speedup compared to conventional auto-regressive re-translation without compromising quality.
arXiv Detail & Related papers (2025-09-26T01:13:37Z)
Sequence Complementor: Complementing Transformers For Time Series Forecasting with Learnable Sequences [5.244482076690776]
We find that expressive capability of sequence representation is a key factor influencing Transformer performance in time forecasting. We propose a novel attention mechanism with Sequence Complementors and prove feasible from an information theory perspective.
arXiv Detail & Related papers (2025-01-06T03:08:39Z)
LinFormer: A Linear-based Lightweight Transformer Architecture For Time-Aware MIMO Channel Prediction [39.12741712294741]
6th generation (6G) mobile networks bring new challenges in supporting high-mobility communications. We present LinFormer, an innovative channel prediction framework based on a scalable, all-linear, encoder-only Transformer model. Our approach achieves a substantial reduction in computational complexity while maintaining high prediction accuracy, making it more suitable for deployment in cost-effective base stations (BS)
arXiv Detail & Related papers (2024-10-28T13:04:23Z)
PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction. We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences. We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z)
Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [83.48423407316713]
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. Our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Our method achieves a state-of-the-art FID score of 2.01 when integrated with the recent work SiT.
arXiv Detail & Related papers (2024-08-11T07:01:39Z)
Rough Transformers: Lightweight Continuous-Time Sequence Modelling with Path Signatures [46.58170057001437]
We introduce the Rough Transformer, a variation of the Transformer model that operates on continuous-time representations of input sequences. We find that, on a variety of time-series-related tasks, Rough Transformers consistently outperform their vanilla attention counterparts.
arXiv Detail & Related papers (2024-05-31T14:00:44Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
ALERT-Transformer: Bridging Asynchronous and Synchronous Machine Learning for Real-Time Event-based Spatio-Temporal Data [8.660721666999718]
We propose a hybrid pipeline composed of asynchronous sensing and synchronous processing. We achieve performances state-of-the-art with a lower latency than competitors.
arXiv Detail & Related papers (2024-02-02T13:17:19Z)
Learning Stationary Markov Processes with Contrastive Adjustment [2.76240219662896]
We introduce a new optimization algorithm, termed emphcontrastive adjustment, for learning Markov transition kernels. Contrastive adjustment is not restricted to a particular family of transition distributions and can be used to model data in both continuous and discrete state spaces. We show that contrastive adjustment is highly valuable in human-computer design processes.
arXiv Detail & Related papers (2023-03-09T18:50:15Z)
Towards Long-Term Time-Series Forecasting: Feature, Pattern, and Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning. Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism. We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z)
Error Correction Code Transformer [92.10654749898927]
We propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths. We encode each channel's output dimension to high dimension for better representation of the bits information to be processed separately. The proposed approach demonstrates the extreme power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins at a fraction of their time complexity.
arXiv Detail & Related papers (2022-03-27T15:25:58Z)
CCVS: Context-aware Controllable Video Synthesis [95.22008742695772]
presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones. It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control.
arXiv Detail & Related papers (2021-07-16T17:57:44Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
Streaming Simultaneous Speech Translation with Augmented Memory Transformer [29.248366441276662]
Transformer-based models have achieved state-of-the-art performance on speech translation tasks. We propose an end-to-end transformer-based sequence-to-sequence model, equipped with an augmented memory transformer encoder.
arXiv Detail & Related papers (2020-10-30T18:28:42Z)
Cascaded Text Generation with Markov Transformers [122.76100449018061]
Two dominant approaches to neural text generation are fully autoregressive models, using serial beam search decoding, and non-autoregressive models, using parallel decoding with no output dependencies. This work proposes an autoregressive model with sub-linear parallel time generation. Noting that conditional random fields with bounded context can be decoded in parallel, we propose an efficient cascaded decoding approach for generating high-quality output. This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets.
arXiv Detail & Related papers (2020-06-01T17:52:15Z)
A Probabilistic Formulation of Unsupervised Text Style Transfer [128.80213211598752]
We present a deep generative model for unsupervised text style transfer that unifies previously proposed non-generative techniques. By hypothesizing a parallel latent sequence that generates each observed sequence, our model learns to transform sequences from one domain to another in a completely unsupervised fashion.
arXiv Detail & Related papers (2020-02-10T16:20:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.