Do sequence-to-sequence VAEs learn global features of sentences?
- URL: http://arxiv.org/abs/2004.07683v2
- Date: Sun, 28 Mar 2021 18:59:31 GMT
- Title: Do sequence-to-sequence VAEs learn global features of sentences?
- Authors: Tom Bosc and Pascal Vincent
- Abstract summary: We study the Varienational Autoencoder (VAE) for natural language with the sequence-to-sequence architecture.
We find that VAEs are prone to memorizing the first words and the sentence length, producing local features of limited usefulness.
These variants learn latent variables that are more global, i.e., more predictive of topic or sentiment labels.
- Score: 13.43800646539014
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Autoregressive language models are powerful and relatively easy to train.
However, these models are usually trained without explicit conditioning labels
and do not offer easy ways to control global aspects such as sentiment or topic
during generation. Bowman & al. (2016) adapted the Variational Autoencoder
(VAE) for natural language with the sequence-to-sequence architecture and
claimed that the latent vector was able to capture such global features in an
unsupervised manner. We question this claim. We measure which words benefit
most from the latent information by decomposing the reconstruction loss per
position in the sentence. Using this method, we find that VAEs are prone to
memorizing the first words and the sentence length, producing local features of
limited usefulness. To alleviate this, we investigate alternative architectures
based on bag-of-words assumptions and language model pretraining. These
variants learn latent variables that are more global, i.e., more predictive of
topic or sentiment labels. Moreover, using reconstructions, we observe that
they decrease memorization: the first word and the sentence length are not
recovered as accurately than with the baselines, consequently yielding more
diverse reconstructions.
Related papers
- Demystifying Verbatim Memorization in Large Language Models [67.49068128909349]
Large Language Models (LLMs) frequently memorize long sequences verbatim, often with serious legal and privacy implications.
We develop a framework to study verbatim memorization in a controlled setting by continuing pre-training from Pythia checkpoints with injected sequences.
We find that (1) non-trivial amounts of repetition are necessary for verbatim memorization to happen; (2) later (and presumably better) checkpoints are more likely to memorize verbatim sequences, even for out-of-distribution sequences.
arXiv Detail & Related papers (2024-07-25T07:10:31Z) - Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.
This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - How to Plant Trees in Language Models: Data and Architectural Effects on
the Emergence of Syntactic Inductive Biases [28.58785395946639]
We show that pre-training can teach language models to rely on hierarchical syntactic features when performing tasks after fine-tuning.
We focus on architectural features (depth, width, and number of parameters), as well as the genre and size of the pre-training corpus.
arXiv Detail & Related papers (2023-05-31T14:38:14Z) - ResMem: Learn what you can and memorize the rest [79.19649788662511]
We propose the residual-memorization (ResMem) algorithm to augment an existing prediction model.
By construction, ResMem can explicitly memorize the training labels.
We show that ResMem consistently improves the test set generalization of the original prediction model.
arXiv Detail & Related papers (2023-02-03T07:12:55Z) - Real-World Compositional Generalization with Disentangled
Sequence-to-Sequence Learning [81.24269148865555]
A recently proposed Disentangled sequence-to-sequence model (Dangle) shows promising generalization capability.
We introduce two key modifications to this model which encourage more disentangled representations and improve its compute and memory efficiency.
Specifically, instead of adaptively re-encoding source keys and values at each time step, we disentangle their representations and only re-encode keys periodically.
arXiv Detail & Related papers (2022-12-12T15:40:30Z) - Reweighting Strategy based on Synthetic Data Identification for Sentence
Similarity [30.647497555295974]
We train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences.
The distilled information from the classifier is then used to train a reliable sentence embedding model.
Our model trained on synthetic data generalizes well and outperforms the existing baselines.
arXiv Detail & Related papers (2022-08-29T05:42:22Z) - OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal
Regression [94.28253749970534]
We propose to learn the rank concepts from the rich semantic CLIP latent space.
OrdinalCLIP consists of learnable context tokens and learnable rank embeddings.
Experimental results show that our paradigm achieves competitive performance in general ordinal regression tasks.
arXiv Detail & Related papers (2022-06-06T03:54:53Z) - How BPE Affects Memorization in Transformers [36.53583838619203]
We show that the size of the subword vocabulary learned by Byte-Pair QA (BPE) greatly affects both ability and tendency of standard Transformer models to memorize training data.
We conjecture this effect is caused by reduction in the sequences' length that happens as the BPE vocabulary grows.
arXiv Detail & Related papers (2021-10-06T14:01:56Z) - Generative Text Modeling through Short Run Inference [47.73892773331617]
The present work proposes a short run dynamics for inference. It is variation from the prior distribution of the latent variable and then runs a small number of Langevin dynamics steps guided by its posterior distribution.
We show that the models trained with short run dynamics more accurately model the data, compared to strong language model and VAE baselines, and exhibit no sign of posterior collapse.
arXiv Detail & Related papers (2021-05-27T09:14:35Z) - Constructing interval variables via faceted Rasch measurement and
multitask deep learning: a hate speech application [63.10266319378212]
We propose a method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT)
We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers.
arXiv Detail & Related papers (2020-09-22T02:15:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.