Related papers: Bridging Information-Theoretic and Geometric Compression in Language Models

Bridging Information-Theoretic and Geometric Compression in Language Models

URL: http://arxiv.org/abs/2310.13620v2
Date: Thu, 9 Nov 2023 14:03:46 GMT
Title: Bridging Information-Theoretic and Geometric Compression in Language Models
Authors: Emily Cheng, Corentin Kervadec, and Marco Baroni
Abstract summary: For a language model to faithfully model human language, it must compress vast, potentially infinite information into relatively few dimensions. We show that, in turn, high compression of a linguistic dataset predicts rapid adaptation to that dataset. As a practical byproduct of our analysis, we evaluate a battery of intrinsic dimension estimators for the first time on linguistic data.
Score: 11.96710733444808
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For a language model (LM) to faithfully model human language, it must compress vast, potentially infinite information into relatively few dimensions. We propose analyzing compression in (pre-trained) LMs from two points of view: geometric and information-theoretic. We demonstrate that the two views are highly correlated, such that the intrinsic geometric dimension of linguistic data predicts their coding length under the LM. We then show that, in turn, high compression of a linguistic dataset predicts rapid adaptation to that dataset, confirming that being able to compress linguistic information is an important part of successful LM performance. As a practical byproduct of our analysis, we evaluate a battery of intrinsic dimension estimators for the first time on linguistic data, showing that only some encapsulate the relationship between information-theoretic compression, geometric compression, and ease-of-adaptation.

Related papers

Compression Hacking: A Supplementary Perspective on Informatics Properties of Language Models from Geometric Distortion [36.47982325967706]
From a geometric standpoint, the word representation space of highly compressed LMs tends to degenerate into a highly anisotropic state.<n>We find this synchronicity is essentially the Compression Hacking'' in LM representations.<n>We propose three refined compression metrics by incorporating geometric distortion analysis and integrate them into a self-evaluation pipeline.
arXiv Detail & Related papers (2025-05-23T12:11:03Z)
Decomposition of surprisal: Unified computational model of ERP components in language processing [7.760815504640362]
We advance an information-theoretic model of human language processing in the brain in which incoming linguistic input is processed at first shallowly and later with more depth. We show that the information content (surprisal) of a word in context can be decomposed into two quantities: (A) shallow surprisal, which signals shallow processing difficulty for a word, and corresponds with the N400 signal; and (B) deep surprisal, which reflects the discrepancy between shallow and deep representations, and corresponds to the P600 signal.
arXiv Detail & Related papers (2024-09-10T18:14:02Z)
Compression Represents Intelligence Linearly [14.651664954289354]
Large language models (LLMs) have been shown to be equivalent to compression. Despite such appealing discussions, little empirical evidence is present for the interplay between compression and intelligence. Across 12 benchmarks, our study brings together 31 public LLMs that originate from diverse organizations. Remarkably, we find that LLMs' intelligence almost linearly correlates with their ability to compress external text corpora.
arXiv Detail & Related papers (2024-04-15T17:03:41Z)
TexShape: Information Theoretic Sentence Embedding for Language Models [5.265661844206274]
This paper addresses challenges regarding encoding sentences to their optimized representations through the lens of information-theory. We use empirical estimates of mutual information, using the Donsker-Varadhan definition of Kullback-Leibler divergence. Our experiments demonstrate significant advancements in preserving maximal targeted information and minimal sensitive information over adverse compression ratios.
arXiv Detail & Related papers (2024-02-05T22:48:28Z)
Evaluating Neural Language Models as Cognitive Models of Language Acquisition [4.779196219827507]
We argue that some of the most prominent benchmarks for evaluating the syntactic capacities of neural language models may not be sufficiently rigorous. When trained on small-scale data modeling child language acquisition, the LMs can be readily matched by simple baseline models. We conclude with suggestions for better connecting LMs with the empirical study of child language acquisition.
arXiv Detail & Related papers (2023-10-31T00:16:17Z)
RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE) It encodes the text corpus into a latent space, capturing current and future information from both source and target text. Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z)
In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots. ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z)
Synthetic Pre-Training Tasks for Neural Machine Translation [16.6378815054841]
Our goal is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources. We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge. Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data.
arXiv Detail & Related papers (2022-12-19T21:34:00Z)
What Do Compressed Multilingual Machine Translation Models Forget? [102.50127671423752]
We show that the performance of under-represented languages drops significantly, while the average BLEU metric only slightly decreases. We demonstrate that compression amplifies intrinsic gender and semantic biases, even in high-resource languages.
arXiv Detail & Related papers (2022-05-22T13:54:44Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes. We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)
Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
Data Augmentation for Spoken Language Understanding via Pretrained Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity. We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.