Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling
- URL: http://arxiv.org/abs/2509.08753v2
- Date: Mon, 29 Sep 2025 16:17:12 GMT
- Title: Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling
- Authors: Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Pérez, Laurent Mazaré, Alexandre Défossez,
- Abstract summary: Delayed Streams Modeling is a flexible formulation for sequence-to-sequence learning.<n>It provides streaming inference of arbitrary output sequences from any input combination.
- Score: 57.708486655254966
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling
Related papers
- Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling [76.23539797803681]
Existing methods primarily use a look mechanism, relying on future text to achieve natural streaming speech synthesis.<n>We propose LE, a streaming framework for generating high-quality speech frame-by-frame.<n> Experimental results suggest that the LE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems.
arXiv Detail & Related papers (2025-05-26T08:25:01Z) - Sequential Order-Robust Mamba for Time Series Forecasting [5.265578815577529]
Mamba has emerged as a promising alternative to Transformers, offering near-linear complexity in processing sequential data.
We propose SOR-Mamba, a TS forecasting method that incorporates a regularization strategy to minimize the discrepancy between two embedding vectors generated from data with reversed channel orders.
We also introduce channel correlation modeling (CCM), a pretraining task aimed at preserving correlations between channels from the data space to the latent space in order to enhance the ability to capture CD.
arXiv Detail & Related papers (2024-10-30T18:05:22Z) - Non-autoregressive Sequence-to-Sequence Vision-Language Models [59.445765313094434]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.<n>The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z) - Streaming Sequence Transduction through Dynamic Compression [52.736991266286196]
We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams.<n> STAR dynamically segments input streams to create compressed anchor representations, achieving nearly lossless compression (12x) in Automatic Speech Recognition (ASR)<n> STAR demonstrates superior segmentation and latency-quality trade-offs in simultaneous speech-to-text tasks, optimizing latency, memory footprint, and quality.
arXiv Detail & Related papers (2024-02-02T06:31:50Z) - Masked Audio Generation using a Single Non-Autoregressive Transformer [90.11646612273965]
MAGNeT is a masked generative sequence modeling method that operates directly over several streams of audio tokens.
We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation.
We shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling.
arXiv Detail & Related papers (2024-01-09T14:29:39Z) - SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
In this work, we apply diffusion models to approach sequence-to-sequence text generation.
We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation.
Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
arXiv Detail & Related papers (2022-12-20T15:16:24Z) - A Contextual Latent Space Model: Subsequence Modulation in Melodic
Sequence [0.0]
Some generative models for sequences such as music and text allow us to edit only subsequences, given surrounding context sequences.
We propose a contextual latent space model (M) in order for users to be able to explore subsequence generation with a sense of direction in the generation space.
A context-informed prior and decoder constitute the generative model of CLSM, and a context position-informed is the inference model.
arXiv Detail & Related papers (2021-11-23T07:51:39Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.