Related papers: Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

URL: http://arxiv.org/abs/2509.08753v2
Date: Mon, 29 Sep 2025 16:17:12 GMT
Title: Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling
Authors: Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Pérez, Laurent Mazaré, Alexandre Défossez,
Abstract summary: Delayed Streams Modeling is a flexible formulation for sequence-to-sequence learning.<n>It provides streaming inference of arbitrary output sequences from any input combination.
Score: 57.708486655254966
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling

Related papers

Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling [76.23539797803681]
Existing methods primarily use a look mechanism, relying on future text to achieve natural streaming speech synthesis.<n>We propose LE, a streaming framework for generating high-quality speech frame-by-frame.<n> Experimental results suggest that the LE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems.
arXiv Detail & Related papers (2025-05-26T08:25:01Z)
Sequential Order-Robust Mamba for Time Series Forecasting [5.265578815577529]
Mamba has emerged as a promising alternative to Transformers, offering near-linear complexity in processing sequential data. We propose SOR-Mamba, a TS forecasting method that incorporates a regularization strategy to minimize the discrepancy between two embedding vectors generated from data with reversed channel orders. We also introduce channel correlation modeling (CCM), a pretraining task aimed at preserving correlations between channels from the data space to the latent space in order to enhance the ability to capture CD.
arXiv Detail & Related papers (2024-10-30T18:05:22Z)
Non-autoregressive Sequence-to-Sequence Vision-Language Models [59.445765313094434]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.<n>The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z)
Streaming Sequence Transduction through Dynamic Compression [52.736991266286196]
We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams.<n> STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR)<n> STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.
arXiv Detail & Related papers (2024-02-02T06:31:50Z)
Masked Audio Generation using a Single Non-Autoregressive Transformer [90.11646612273965]
MAGNeT is a masked generative sequence modeling method that operates directly over several streams of audio tokens. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation. We shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling.
arXiv Detail & Related papers (2024-01-09T14:29:39Z)
SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
In this work, we apply diffusion models to approach sequence-to-sequence text generation. We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation. Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
arXiv Detail & Related papers (2022-12-20T15:16:24Z)
A Contextual Latent Space Model: Subsequence Modulation in Melodic Sequence [0.0]
Some generative models for sequences such as music and text allow us to edit only subsequences, given surrounding context sequences. We propose a contextual latent space model (M) in order for users to be able to explore subsequence generation with a sense of direction in the generation space. A context-informed prior and decoder constitute the generative model of CLSM, and a context position-informed is the inference model.
arXiv Detail & Related papers (2021-11-23T07:51:39Z)
Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.