Fuse It More Deeply! A Variational Transformer with Layer-Wise Latent
Variable Inference for Text Generation
- URL: http://arxiv.org/abs/2207.06130v1
- Date: Wed, 13 Jul 2022 11:27:46 GMT
- Title: Fuse It More Deeply! A Variational Transformer with Layer-Wise Latent
Variable Inference for Text Generation
- Authors: Jinyi Hu, Xiaoyuan Yi, Wenhao Li, Maosong Sun, Xing Xie
- Abstract summary: We propose a novel variational Transformer framework to overcome the KL vanishing problem.
We show that our method can be regarded as entangling latent variables to avoid posterior information decrease through layers.
- Score: 85.5379146125199
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The past several years have witnessed Variational Auto-Encoder's superiority
in various text generation tasks. However, due to the sequential nature of the
text, auto-regressive decoders tend to ignore latent variables and then reduce
to simple language models, known as the KL vanishing problem, which would
further deteriorate when VAE is combined with Transformer-based structures. To
ameliorate this problem, we propose DELLA, a novel variational Transformer
framework. DELLA learns a series of layer-wise latent variables with each
inferred from those of lower layers and tightly coupled with the hidden states
by low-rank tensor product. In this way, DELLA forces these posterior latent
variables to be fused deeply with the whole computation path and hence
incorporate more information. We theoretically demonstrate that our method can
be regarded as entangling latent variables to avoid posterior information
decrease through layers, enabling DELLA to get higher non-zero KL values even
without any annealing or thresholding tricks. Experiments on four unconditional
and three conditional generation tasks show that DELLA could better alleviate
KL vanishing and improve both quality and diversity compared to several strong
baselines.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - Does learning the right latent variables necessarily improve in-context learning? [13.828665019247444]
Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights.
In this paper, we investigate the effect of explicitly inferring task latents.
We find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance.
arXiv Detail & Related papers (2024-05-29T15:06:10Z) - VOLTA: Improving Generative Diversity by Variational Mutual Information Maximizing Autoencoder [38.35049378875308]
We introduce VOLTA, a framework that elevates generative diversity by bridging Transformer with VAE.
We perform comprehensive experiments with two types of Transformers on six datasets to show that our approach can significantly improve generative diversity while maintaining generative quality.
arXiv Detail & Related papers (2023-07-03T08:45:42Z) - Recurrence Boosts Diversity! Revisiting Recurrent Latent Variable in
Transformer-Based Variational AutoEncoder for Diverse Text Generation [85.5379146125199]
Variational Auto-Encoder (VAE) has been widely adopted in text generation.
We propose TRACE, a Transformer-based recurrent VAE structure.
arXiv Detail & Related papers (2022-10-22T10:25:35Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure.
Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z) - Preventing Posterior Collapse with Levenshtein Variational Autoencoder [61.30283661804425]
We propose to replace the evidence lower bound (ELBO) with a new objective which is simple to optimize and prevents posterior collapse.
We show that Levenstein VAE produces more informative latent representations than alternative approaches to preventing posterior collapse.
arXiv Detail & Related papers (2020-04-30T13:27:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.