Related papers: Do sequence-to-sequence VAEs learn global features of sentences?

Do sequence-to-sequence VAEs learn global features of sentences?

URL: http://arxiv.org/abs/2004.07683v2
Date: Sun, 28 Mar 2021 18:59:31 GMT
Title: Do sequence-to-sequence VAEs learn global features of sentences?
Authors: Tom Bosc and Pascal Vincent
Abstract summary: We study the Varienational Autoencoder (VAE) for natural language with the sequence-to-sequence architecture. We find that VAEs are prone to memorizing the first words and the sentence length, producing local features of limited usefulness. These variants learn latent variables that are more global, i.e., more predictive of topic or sentiment labels.
Score: 13.43800646539014
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Autoregressive language models are powerful and relatively easy to train. However, these models are usually trained without explicit conditioning labels and do not offer easy ways to control global aspects such as sentiment or topic during generation. Bowman & al. (2016) adapted the Variational Autoencoder (VAE) for natural language with the sequence-to-sequence architecture and claimed that the latent vector was able to capture such global features in an unsupervised manner. We question this claim. We measure which words benefit most from the latent information by decomposing the reconstruction loss per position in the sentence. Using this method, we find that VAEs are prone to memorizing the first words and the sentence length, producing local features of limited usefulness. To alleviate this, we investigate alternative architectures based on bag-of-words assumptions and language model pretraining. These variants learn latent variables that are more global, i.e., more predictive of topic or sentiment labels. Moreover, using reconstructions, we observe that they decrease memorization: the first word and the sentence length are not recovered as accurately than with the baselines, consequently yielding more diverse reconstructions.

Related papers

Demystifying Verbatim Memorization in Large Language Models [67.49068128909349]
Large Language Models (LLMs) frequently memorize long sequences verbatim, often with serious legal and privacy implications. We develop a framework to study verbatim memorization in a controlled setting by continuing pre-training from Pythia checkpoints with injected sequences. We find that (1) non-trivial amounts of repetition are necessary for verbatim memorization to happen; (2) later (and presumably better) checkpoints are more likely to memorize verbatim sequences, even for out-of-distribution sequences.
arXiv Detail & Related papers (2024-07-25T07:10:31Z)
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z)
How to Plant Trees in Language Models: Data and Architectural Effects on the Emergence of Syntactic Inductive Biases [28.58785395946639]
We show that pre-training can teach language models to rely on hierarchical syntactic features when performing tasks after fine-tuning. We focus on architectural features (depth, width, and number of parameters), as well as the genre and size of the pre-training corpus.
arXiv Detail & Related papers (2023-05-31T14:38:14Z)
ResMem: Learn what you can and memorize the rest [79.19649788662511]
We propose the residual-memorization (ResMem) algorithm to augment an existing prediction model. By construction, ResMem can explicitly memorize the training labels. We show that ResMem consistently improves the test set generalization of the original prediction model.
arXiv Detail & Related papers (2023-02-03T07:12:55Z)
Real-World Compositional Generalization with Disentangled Sequence-to-Sequence Learning [81.24269148865555]
A recently proposed Disentangled sequence-to-sequence model (Dangle) shows promising generalization capability. We introduce two key modifications to this model which encourage more disentangled representations and improve its compute and memory efficiency. Specifically, instead of adaptively re-encoding source keys and values at each time step, we disentangle their representations and only re-encode keys periodically.
arXiv Detail & Related papers (2022-12-12T15:40:30Z)
Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity [30.647497555295974]
We train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences. The distilled information from the classifier is then used to train a reliable sentence embedding model. Our model trained on synthetic data generalizes well and outperforms the existing baselines.
arXiv Detail & Related papers (2022-08-29T05:42:22Z)
OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression [94.28253749970534]
We propose to learn the rank concepts from the rich semantic CLIP latent space. OrdinalCLIP consists of learnable context tokens and learnable rank embeddings. Experimental results show that our paradigm achieves competitive performance in general ordinal regression tasks.
arXiv Detail & Related papers (2022-06-06T03:54:53Z)
How BPE Affects Memorization in Transformers [36.53583838619203]
We show that the size of the subword vocabulary learned by Byte-Pair QA (BPE) greatly affects both ability and tendency of standard Transformer models to memorize training data. We conjecture this effect is caused by reduction in the sequences' length that happens as the BPE vocabulary grows.
arXiv Detail & Related papers (2021-10-06T14:01:56Z)
Generative Text Modeling through Short Run Inference [47.73892773331617]
The present work proposes a short run dynamics for inference. It is variation from the prior distribution of the latent variable and then runs a small number of Langevin dynamics steps guided by its posterior distribution. We show that the models trained with short run dynamics more accurately model the data, compared to strong language model and VAE baselines, and exhibit no sign of posterior collapse.
arXiv Detail & Related papers (2021-05-27T09:14:35Z)
Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application [63.10266319378212]
We propose a method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT) We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers.
arXiv Detail & Related papers (2020-09-22T02:15:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.