Mode recovery in neural autoregressive sequence modeling
- URL: http://arxiv.org/abs/2106.05459v1
- Date: Thu, 10 Jun 2021 02:17:28 GMT
- Title: Mode recovery in neural autoregressive sequence modeling
- Authors: Ilia Kulikov, Sean Welleck, Kyunghyun Cho
- Abstract summary: Recent studies have revealed unexpected and undesirable properties of neural autoregressive sequence models.
We investigate how the modes, or local maxima, of a distribution are maintained throughout the full learning chain.
We conclude that future research must consider the entire learning chain in order to fully understand the potentials and perils.
- Score: 55.05526174291747
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite its wide use, recent studies have revealed unexpected and undesirable
properties of neural autoregressive sequence models trained with maximum
likelihood, such as an unreasonably high affinity to short sequences after
training and to infinitely long sequences at decoding time. We propose to study
these phenomena by investigating how the modes, or local maxima, of a
distribution are maintained throughout the full learning chain of the
ground-truth, empirical, learned and decoding-induced distributions, via the
newly proposed mode recovery cost. We design a tractable testbed where we build
three types of ground-truth distributions: (1) an LSTM based structured
distribution, (2) an unstructured distribution where probability of a sequence
does not depend on its content, and (3) a product of these two which we call a
semi-structured distribution. Our study reveals both expected and unexpected
findings. First, starting with data collection, mode recovery cost strongly
relies on the ground-truth distribution and is most costly with the
semi-structured distribution. Second, after learning, mode recovery cost from
the ground-truth distribution may increase or decrease compared to data
collection, with the largest cost degradation occurring with the
semi-structured ground-truth distribution. Finally, the ability of the
decoding-induced distribution to recover modes from the learned distribution is
highly impacted by the choices made earlier in the learning chain. We conclude
that future research must consider the entire learning chain in order to fully
understand the potentials and perils and to further improve neural
autoregressive sequence models.
Related papers
- Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training [11.253812961752958]
Generative Artificial Intelligence (AI) has become a transformative force across science, industry, and society.<n>As these systems grow in popularity, web data becomes increasingly interwoven with this AI-generated material.<n>As generative models are updated regularly, later models will inevitably be trained on mixtures of human-generated data and AI-generated data from earlier versions.
arXiv Detail & Related papers (2026-02-17T22:38:18Z) - Learning a Generative Meta-Model of LLM Activations [75.30161960337892]
We create "meta-models" that learn the distribution of a network's internal states.<n>Applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases.<n>These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions.
arXiv Detail & Related papers (2026-02-06T18:59:56Z) - Residual Prior Diffusion: A Probabilistic Framework Integrating Coarse Latent Priors with Diffusion Models [0.5753274939310764]
Residual Prior Diffusion (RPD) is a two-stage framework in which a coarse prior model first captures the large-scale structure of the data distribution.<n>RPD accurately captures fine-scale detail while preserving the large-scale structure of the distribution.<n>On natural image generation tasks, RPD achieved generation quality that matched or exceeded that of representative diffusion-based baselines.
arXiv Detail & Related papers (2025-12-25T09:19:10Z) - Limits of Generative Pre-Training in Structured EMR Trajectories with Irregular Sampling [0.7537475180985093]
Foundation models refer to architectures trained on vast datasets using autoregressive pre-training to capture intricate patterns and motifs.<n>We trained two autoregressive models -- a sequence-to-sequence LSTM and a reduced Transformer -- on longitudinal ART for HIV and Acute Hypotension datasets.<n> Controlled irregularity was added during training via random inter-visit gaps, while test sequences stayed complete.<n>Both reproduced feature distributions but failed to preserve cross-feature structure.
arXiv Detail & Related papers (2025-10-27T00:04:17Z) - Learning Robust Diffusion Models from Imprecise Supervision [75.53546939251146]
DMIS is a unified framework for training robust Conditional Diffusion Models from Imprecise Supervision.<n>Our framework is derived from likelihood and decomposes the objective into generative and classification components.<n>Experiments on diverse forms of imprecise supervision, covering tasks covering image generation, weakly supervised learning, and dataset condensation demonstrate that DMIS consistently produces high-quality and class-discriminative samples.
arXiv Detail & Related papers (2025-10-03T14:00:32Z) - Are you SURE? Enhancing Multimodal Pretraining with Missing Modalities through Uncertainty Estimation [12.459901557580052]
We present SURE, a novel framework that extends the capabilities of pretrained multimodal models by introducing latent space reconstruction and uncertainty estimation.
We show that SURE consistently achieves state-of-the-art performance, ensuring robust predictions even in the presence of incomplete data.
arXiv Detail & Related papers (2025-04-18T05:07:20Z) - Parallelly Tempered Generative Adversarial Networks [7.94957965474334]
A generative adversarial network (GAN) has been a representative backbone model in generative artificial intelligence (AI)
This work analyzes the training instability and inefficiency in the presence of mode collapse by linking it to multimodality in the target distribution.
With our newly developed GAN objective function, the generator can learn all the tempered distributions simultaneously.
arXiv Detail & Related papers (2024-11-18T18:01:13Z) - Constrained Diffusion Models via Dual Training [80.03953599062365]
Diffusion processes are prone to generating samples that reflect biases in a training dataset.
We develop constrained diffusion models by imposing diffusion constraints based on desired distributions.
We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints.
arXiv Detail & Related papers (2024-08-27T14:25:42Z) - Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.
We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.
Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z) - A Pseudo-Semantic Loss for Autoregressive Models with Logical
Constraints [87.08677547257733]
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning.
We show how to maximize the likelihood of a symbolic constraint w.r.t the neural network's output distribution.
We also evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation.
arXiv Detail & Related papers (2023-12-06T20:58:07Z) - Unimodal Distributions for Ordinal Regression [2.642698101441705]
We propose two new approaches to incorporate the preference for unimodal distributions into the predictive model.
We analyse the set of unimodal distributions in the probability simplex and establish fundamental properties.
We then propose a new architecture that imposes unimodal distributions and a new loss term that relies on the notion of projection in a set to promote unimodality.
arXiv Detail & Related papers (2023-03-08T13:00:40Z) - JANA: Jointly Amortized Neural Approximation of Complex Bayesian Models [0.5872014229110214]
We propose jointly amortized neural approximation'' (JANA) of intractable likelihood functions and posterior densities.
We benchmark the fidelity of JANA on a variety of simulation models against state-of-the-art Bayesian methods.
arXiv Detail & Related papers (2023-02-17T20:17:21Z) - Distributional Reinforcement Learning via Moment Matching [54.16108052278444]
We formulate a method that learns a finite set of statistics from each return distribution via neural networks.
Our method can be interpreted as implicitly matching all orders of moments between a return distribution and its Bellman target.
Experiments on the suite of Atari games show that our method outperforms the standard distributional RL baselines.
arXiv Detail & Related papers (2020-07-24T05:18:17Z) - MMCGAN: Generative Adversarial Network with Explicit Manifold Prior [78.58159882218378]
We propose to employ explicit manifold learning as prior to alleviate mode collapse and stabilize training of GAN.
Our experiments on both the toy data and real datasets show the effectiveness of MMCGAN in alleviating mode collapse, stabilizing training, and improving the quality of generated samples.
arXiv Detail & Related papers (2020-06-18T07:38:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.