FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
- URL: http://arxiv.org/abs/2502.11128v2
- Date: Wed, 03 Sep 2025 01:59:58 GMT
- Title: FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
- Authors: Hui Wang, Shujie Liu, Lingwei Meng, Jinyu Li, Yifan Yang, Shiwan Zhao, Haiyang Sun, Yanqing Liu, Haoqin Sun, Jiaming Zhou, Yan Lu, Yong Qin,
- Abstract summary: FELLE is an autoregressive model that integrates language modeling with token-wise flow matching.<n>For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step.<n>FELLE generates continuous-valued tokens hierarchically, conditioned on the language model's output.
- Score: 56.30231216917128
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.
Related papers
- Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations [53.91818843831925]
We propose NExT-Vid, a novel autoregressive visual generative pretraining framework.<n>We introduce a context-isolated autoregressive predictor to decouple semantic representation from target decoding.<n>Through context-isolated flow-matching pretraining, our approach achieves strong representations.
arXiv Detail & Related papers (2025-12-24T07:07:08Z) - Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z) - Model-Aware Tokenizer Transfer [46.13517417540154]
Model-Aware Tokenizer Transfer (MATT) is a method that incorporates model internals into the tokenizer transfer process.<n>MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model.<n>Experiments show that MATT recovers a large fraction of the original model's performance within a few GPU hours, outperforming baselines.
arXiv Detail & Related papers (2025-10-24T18:27:36Z) - Context-level Language Modeling by Learning Predictive Context Embeddings [79.00607069677393]
We introduce textbfContextLM, a framework that augments standard pretraining with an inherent textbfnext-context prediction objective.<n>This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks.<n>Experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance.
arXiv Detail & Related papers (2025-10-23T07:09:45Z) - Visual Self-Refinement for Autoregressive Models [27.0373357661741]
This work proposes a plug-and-play refinement module to enhance the complex spatial correspondence modeling.<n> Experiments demonstrate that the proposed method improves the generation quality, enhancing the model's ability to produce semantically consistent results.
arXiv Detail & Related papers (2025-10-01T15:03:32Z) - Unified Flow Matching for Long Horizon Event Forecasting [3.0639815065447036]
We propose a unified flow matching framework for marked temporal point processes.<n>By learning continuous-time flows for both components, our method generates coherent long horizon event trajectories without sequential decoding.<n>We evaluate our model on six real-world benchmarks and demonstrate significant improvements over autoregressive and diffusion-based baselines in both accuracy and generation efficiency.
arXiv Detail & Related papers (2025-08-06T19:42:49Z) - Hybrid Autoregressive-Diffusion Model for Real-Time Streaming Sign Language Production [0.0]
We introduce a hybrid approach combining autoregressive and diffusion models to generate Sign Language Production (SLP) models.<n>To capture fine-grained body movements, we design a Multi-Scale Pose Representation module that separately extracts detailed features from distinct arttors.<n>We also introduce a Confidence-Aware Causal Attention mechanism that utilizes joint-level confidence scores to dynamically guide the pose generation process.
arXiv Detail & Related papers (2025-07-12T01:34:50Z) - Transition Matching: Scalable and Flexible Generative Modeling [36.605030979361516]
This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation.<n>TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes.
arXiv Detail & Related papers (2025-06-30T07:51:58Z) - Frequency Autoregressive Image Generation with Continuous Tokens [31.833852108014312]
We introduce the frequency progressive autoregressive (textbfFAR) paradigm and instantiate FAR with the continuous tokenizer.
We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset.
arXiv Detail & Related papers (2025-03-07T10:34:04Z) - TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction [5.925383490825323]
Vision-language models (VLMs) have achieved remarkable advancements, capitalizing on the impressive capabilities of large language models (LLMs)
Despite this, a critical challenge known as hallucination occurs when models overconfidently describe objects or attributes absent from the image.
This limitation reduces model reliability in high-stakes applications.
arXiv Detail & Related papers (2025-03-06T14:11:00Z) - Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens [53.99177152562075]
Scaling up autoregressive models in vision has not proven as beneficial as in large language models.
We focus on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed order using BERT- or GPT-like transformer architectures.
Our results show that while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends.
arXiv Detail & Related papers (2024-10-17T17:59:59Z) - Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LM)
This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.
We develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization.
arXiv Detail & Related papers (2024-10-11T23:30:42Z) - CaLMFlow: Volterra Flow Matching using Causal Language Models [14.035963716966787]
CaLMFlow is a framework that casts flow matching as a Volterra integral equation (VIE)
Our method implements tokenization across space and time, thereby solving a VIE over these domains.
We demonstrate CaLMFlow's effectiveness on synthetic and real-world data, including single-cell perturbation response prediction.
arXiv Detail & Related papers (2024-10-03T05:07:41Z) - Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - Non-autoregressive Sequence-to-Sequence Vision-Language Models [63.77614880533488]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.
The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z) - Guided Flows for Generative Modeling and Decision Making [55.42634941614435]
We show that Guided Flows significantly improves the sample quality in conditional image generation and zero-shot text synthesis-to-speech.
Notably, we are first to apply flow models for plan generation in the offline reinforcement learning setting ax speedup in compared to diffusion models.
arXiv Detail & Related papers (2023-11-22T15:07:59Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Attentive Contractive Flow with Lipschitz-constrained Self-Attention [25.84621883831624]
We introduce a novel approach called Attentive Contractive Flow (ACF)
ACF utilizes a special category of flow-based generative models - contractive flows.
We demonstrate that ACF can be introduced into a variety of state of the art flow models in a plug-and-play manner.
arXiv Detail & Related papers (2021-09-24T18:02:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.