Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models
- URL: http://arxiv.org/abs/2407.18158v1
- Date: Thu, 25 Jul 2024 16:13:58 GMT
- Title: Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models
- Authors: Sanae Lotfi, Yilun Kuang, Brandon Amos, Micah Goldblum, Marc Finzi, Andrew Gordon Wilson,
- Abstract summary: We derive non-vacuous generalization bounds for large language models as large as LLaMA2-70B.
Our work achieves the first non-vacuous bounds for models that are deployed in practice and generate high-quality text.
- Score: 79.70436109672599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) with billions of parameters excel at predicting the next token in a sequence. Recent work computes non-vacuous compression-based generalization bounds for LLMs, but these bounds are vacuous for large models at the billion-parameter scale. Moreover, these bounds are obtained through restrictive compression techniques, bounding compressed models that generate low-quality text. Additionally, the tightness of these existing bounds depends on the number of IID documents in a training set rather than the much larger number of non-IID constituent tokens, leaving untapped potential for tighter bounds. In this work, we instead use properties of martingales to derive generalization bounds that benefit from the vast number of tokens in LLM training sets. Since a dataset contains far more tokens than documents, our generalization bounds not only tolerate but actually benefit from far less restrictive compression schemes. With Monarch matrices, Kronecker factorizations, and post-training quantization, we achieve non-vacuous generalization bounds for LLMs as large as LLaMA2-70B. Unlike previous approaches, our work achieves the first non-vacuous bounds for models that are deployed in practice and generate high-quality text.
Related papers
- Demystifying Singular Defects in Large Language Models [61.98878352956125]
In large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored.
We provide both theoretical insights and empirical validation across a range of recent models.
We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures.
arXiv Detail & Related papers (2025-02-10T20:09:16Z) - Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)
We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.
PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z) - The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation [15.904856111636851]
This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models on very small datasets.
We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples, the long-sequence generative capabilities are greatly enhanced.
arXiv Detail & Related papers (2024-12-05T16:34:20Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact [46.32830393597601]
Large language models (LLMs) excel in natural language processing but demand intensive computation.
This paper unveils a previously overlooked type of outliers in LLMs.
We propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model.
arXiv Detail & Related papers (2024-03-02T16:05:26Z) - Non-Vacuous Generalization Bounds for Large Language Models [78.42762571499061]
We provide the first non-vacuous generalization bounds for pretrained large language models.
We show that larger models have better generalization bounds and are more compressible than smaller models.
arXiv Detail & Related papers (2023-12-28T17:58:42Z) - TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models [69.49978333446538]
TEAL is an approach to treat the input from any modality as a token sequence.
It embeds the token sequence into a joint embedding space with a learnable embedding matrix.
Experiments show that TEAL achieves substantial improvements in multi-modal understanding.
arXiv Detail & Related papers (2023-11-08T10:34:16Z) - Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLM [31.25193238045053]
We introduce a novel method, namely GenCo, which leverages the strong generative power of large language models to assist in training a smaller language model.
In our method, an LLM plays an important role in the self-training loop of a smaller model in two important ways.
It helps crafting additional high-quality training pairs, by rewriting input texts conditioned on predicted labels.
arXiv Detail & Related papers (2023-04-24T07:35:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.