Related papers: FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

URL: http://arxiv.org/abs/2502.11128v2
Date: Wed, 03 Sep 2025 01:59:58 GMT
Title: FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
Authors: Hui Wang, Shujie Liu, Lingwei Meng, Jinyu Li, Yifan Yang, Shiwan Zhao, Haiyang Sun, Yanqing Liu, Haoqin Sun, Jiaming Zhou, Yan Lu, Yong Qin,
Abstract summary: FELLE is an autoregressive model that integrates language modeling with token-wise flow matching.<n>For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step.<n>FELLE generates continuous-valued tokens hierarchically, conditioned on the language model's output.
Score: 56.30231216917128
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.

Related papers

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations [53.91818843831925]
We propose NExT-Vid, a novel autoregressive visual generative pretraining framework.<n>We introduce a context-isolated autoregressive predictor to decouple semantic representation from target decoding.<n>Through context-isolated flow-matching pretraining, our approach achieves strong representations.
arXiv Detail & Related papers (2025-12-24T07:07:08Z)
Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z)
Model-Aware Tokenizer Transfer [46.13517417540154]
Model-Aware Tokenizer Transfer (MATT) is a method that incorporates model internals into the tokenizer transfer process.<n>MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model.<n>Experiments show that MATT recovers a large fraction of the original model's performance within a few GPU hours, outperforming baselines.
arXiv Detail & Related papers (2025-10-24T18:27:36Z)
Context-level Language Modeling by Learning Predictive Context Embeddings [79.00607069677393]
We introduce textbfContextLM, a framework that augments standard pretraining with an inherent textbfnext-context prediction objective.<n>This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks.<n>Experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance.
arXiv Detail & Related papers (2025-10-23T07:09:45Z)
Visual Self-Refinement for Autoregressive Models [27.0373357661741]
This work proposes a plug-and-play refinement module to enhance the complex spatial correspondence modeling.<n> Experiments demonstrate that the proposed method improves the generation quality, enhancing the model's ability to produce semantically consistent results.
arXiv Detail & Related papers (2025-10-01T15:03:32Z)
Unified Flow Matching for Long Horizon Event Forecasting [3.0639815065447036]
We propose a unified flow matching framework for marked temporal point processes.<n>By learning continuous-time flows for both components, our method generates coherent long horizon event trajectories without sequential decoding.<n>We evaluate our model on six real-world benchmarks and demonstrate significant improvements over autoregressive and diffusion-based baselines in both accuracy and generation efficiency.
arXiv Detail & Related papers (2025-08-06T19:42:49Z)
Hybrid Autoregressive-Diffusion Model for Real-Time Streaming Sign Language Production [0.0]
We introduce a hybrid approach combining autoregressive and diffusion models to generate Sign Language Production (SLP) models.<n>To capture fine-grained body movements, we design a Multi-Scale Pose Representation module that separately extracts detailed features from distinct arttors.<n>We also introduce a Confidence-Aware Causal Attention mechanism that utilizes joint-level confidence scores to dynamically guide the pose generation process.
arXiv Detail & Related papers (2025-07-12T01:34:50Z)
Transition Matching: Scalable and Flexible Generative Modeling [36.605030979361516]
This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation.<n>TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes.
arXiv Detail & Related papers (2025-06-30T07:51:58Z)
Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer. We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z)
TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction [5.925383490825323]
Vision-language models (VLMs) have achieved remarkable advancements, capitalizing on the impressive capabilities of large language models (LLMs) Despite this, a critical challenge known as hallucination occurs when models overconfidently describe objects or attributes absent from the image. This limitation reduces model reliability in high-stakes applications.
arXiv Detail & Related papers (2025-03-06T14:11:00Z)
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens [53.99177152562075]
Scaling up autoregressive models in vision has not proven as beneficial as in large language models. We focus on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed order using BERT- or GPT-like transformer architectures. Our results show that while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends.
arXiv Detail & Related papers (2024-10-17T17:59:59Z)
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LM) This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts. We develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization.
arXiv Detail & Related papers (2024-10-11T23:30:42Z)
CaLMFlow: Volterra Flow Matching using Causal Language Models [14.035963716966787]
CaLMFlow is a framework that casts flow matching as a Volterra integral equation (VIE) Our method implements tokenization across space and time, thereby solving a VIE over these domains. We demonstrate CaLMFlow's effectiveness on synthetic and real-world data, including single-cell perturbation response prediction.
arXiv Detail & Related papers (2024-10-03T05:07:41Z)
Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS) MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z)
Non-autoregressive Sequence-to-Sequence Vision-Language Models [63.77614880533488]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder. The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z)
Guided Flows for Generative Modeling and Decision Making [55.42634941614435]
We show that Guided Flows significantly improves the sample quality in conditional image generation and zero-shot text synthesis-to-speech. Notably, we are first to apply flow models for plan generation in the offline reinforcement learning setting ax speedup in compared to diffusion models.
arXiv Detail & Related papers (2023-11-22T15:07:59Z)
Scalable Learning of Latent Language Structure With Logical Offline Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text. As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z)
Attentive Contractive Flow with Lipschitz-constrained Self-Attention [25.84621883831624]
We introduce a novel approach called Attentive Contractive Flow (ACF) ACF utilizes a special category of flow-based generative models - contractive flows. We demonstrate that ACF can be introduced into a variety of state of the art flow models in a plug-and-play manner.
arXiv Detail & Related papers (2021-09-24T18:02:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.