Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms
- URL: http://arxiv.org/abs/2503.07154v2
- Date: Tue, 11 Mar 2025 16:52:41 GMT
- Title: Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms
- Authors: Jiaming Song, Linqi Zhou,
- Abstract summary: We argue that an inference-first perspective can inspire novel generative pre-training algorithms.<n>We show how addressing limitations in diffusion models' inference process through targeted modifications yields a stable, single-stage algorithm.
- Score: 35.74919627230777
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have seen significant advancements in foundation models through generative pre-training, yet algorithmic innovation in this space has largely stagnated around autoregressive models for discrete signals and diffusion models for continuous signals. This stagnation creates a bottleneck that prevents us from fully unlocking the potential of rich multi-modal data, which in turn limits the progress on multimodal intelligence. We argue that an inference-first perspective, which prioritizes scaling efficiency during inference time across sequence length and refinement steps, can inspire novel generative pre-training algorithms. Using Inductive Moment Matching (IMM) as a concrete example, we demonstrate how addressing limitations in diffusion models' inference process through targeted modifications yields a stable, single-stage algorithm that achieves superior sample quality with over an order of magnitude greater inference efficiency.
Related papers
- Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
We investigate how model size, training data scale, and inference-time compute jointly influence generative retrieval performance.
Our experiments show that n-gram-based methods demonstrate strong alignment with both training and inference scaling laws.
We find that LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval.
arXiv Detail & Related papers (2025-03-24T17:59:03Z) - Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps [48.16416920913577]
We explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps.<n>We consider a search problem aimed at identifying better noises for the diffusion sampling process.<n>Our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models.
arXiv Detail & Related papers (2025-01-16T18:30:37Z) - Stochastic Control for Fine-tuning Diffusion Models: Optimality, Regularity, and Convergence [11.400431211239958]
Diffusion models have emerged as powerful tools for generative modeling.<n>We propose a control framework for fine-tuning diffusion models.<n>We show that PI-FT achieves global convergence at a linear rate.
arXiv Detail & Related papers (2024-12-24T04:55:46Z) - Adaptive Non-Uniform Timestep Sampling for Diffusion Model Training [4.760537994346813]
As data distributions grow more complex, training diffusion models to convergence becomes increasingly intensive.
We introduce a non-uniform timestep sampling method that prioritizes these more critical timesteps.
Our method shows robust performance across various datasets, scheduling strategies, and diffusion architectures.
arXiv Detail & Related papers (2024-11-15T07:12:18Z) - Improved Noise Schedule for Diffusion Training [51.849746576387375]
We propose a novel approach to design the noise schedule for enhancing the training of diffusion models.<n>We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule.
arXiv Detail & Related papers (2024-07-03T17:34:55Z) - Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion [61.03681839276652]
Diffusion Forcing is a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels.<n>We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens.
arXiv Detail & Related papers (2024-07-01T15:43:25Z) - Variational quantization for state space models [3.9762742923544456]
forecasting tasks using large datasets gathering thousands of heterogeneous time series is a crucial statistical problem in numerous sectors.
We propose a new forecasting model that combines discrete state space hidden Markov models with recent neural network architectures and training procedures inspired by vector quantized variational autoencoders.
We assess the performance of the proposed method using several datasets and show that it outperforms other state-of-the-art solutions.
arXiv Detail & Related papers (2024-04-17T07:01:41Z) - MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process [26.661721555671626]
We introduce a novel Multi-Granularity Time Series (MG-TSD) model, which achieves state-of-the-art predictive performance.
Our approach does not rely on additional external data, making it versatile and applicable across various domains.
arXiv Detail & Related papers (2024-03-09T01:15:03Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs)
GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations.
We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.