Related papers: Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

URL: http://arxiv.org/abs/2404.06209v1
Date: Tue, 9 Apr 2024 10:58:21 GMT
Title: Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models
Authors: Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, Rich Caruana,
Abstract summary: We introduce a variety of different techniques to assess whether a language model has seen a dataset during training. We compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting.
Score: 21.10890310571397
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. Without fine-tuning, we find them to be limited. This suggests that much of the few-shot performance on novel datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We make the exposure tests we developed available as the tabmemcheck Python package at https://github.com/interpretml/LLM-Tabular-Memorization-Checker

Related papers

Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z)
Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models [15.603556124006479]
We propose retrieval-augmented language models for scalable TabICL. Our approach incorporates a customized retrieval module, combined with retrieval-guided instruction-tuning for LLMs. This enables LLMs to effectively leverage larger datasets, achieving significantly improved performance across 69 widely recognized datasets.
arXiv Detail & Related papers (2025-02-05T13:16:41Z)
Transfer Learning of Tabular Data by Finetuning Large Language Models [0.0]
This paper investigates the effectiveness of an application programming interface (API) and transfer learning of large language models (LLM) LLM APIs respond to input text prompts with tokenized data and instructions, whereas transfer learning finetunes an LLM for a target classification task. This paper proposes an end-to-end finetuning of LLM to demonstrate cross-data transfer learning on ten benchmark data sets.
arXiv Detail & Related papers (2025-01-12T16:23:18Z)
Towards Robust Evaluation of Unlearning in LLMs via Data Transformations [17.927224387698903]
Large Language Models (LLMs) have shown to be a great success in a wide range of applications ranging from regular NLP-based use cases to AI agents. In recent times research in the area of Machine Unlearning (MUL) has become active. Main idea is to force LLMs to forget (unlearn) certain information (e.g., PII) without suffering from performance loss on regular tasks.
arXiv Detail & Related papers (2024-11-23T07:20:36Z)
Dynamic Uncertainty Ranking: Enhancing In-Context Learning for Long-Tail Knowledge in LLMs [50.29035873837]
Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training. Long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models' memorization. We propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions.
arXiv Detail & Related papers (2024-10-31T03:42:17Z)
Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge [55.65162959527848]
Large language models have shown excellent performance on many knowledge-intensive tasks. However, pretraining data tends to contain misleading and even conflicting information. This study systematically analyze LLMs' learning preferences for data with conflicting knowledge.
arXiv Detail & Related papers (2024-10-07T06:49:41Z)
Enhancing Discriminative Tasks by Guiding the Pre-trained Language Model with Large Language Model's Experience [4.814313782484443]
Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks. We use LLMs to generate domain-specific data, thereby improving the performance of pre-trained LMs on the target tasks.
arXiv Detail & Related papers (2024-08-16T06:37:59Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Large Language Models Memorize Sensor Datasets! Implications on Human Activity Recognition Research [0.23982628363233693]
We investigate whether Large Language Models (LLMs) have had access to standard Human Activity Recognition (HAR) datasets during training. Most contemporary LLMs are trained on virtually the entire (accessible) internet -- potentially including standard HAR datasets. For the Daphnet dataset in particular, GPT-4 is able to reproduce blocks of sensor readings.
arXiv Detail & Related papers (2024-06-09T19:38:27Z)
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z)
Elephants Never Forget: Testing Language Models for Memorization of Tabular Data [21.912611415307644]
Large Language Models (LLMs) can be applied to a diverse set of tasks, but the critical issues of data contamination and memorization are often glossed over. We introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization.
arXiv Detail & Related papers (2024-03-11T12:07:13Z)
Task Contamination: Language Models May Not Be Few-Shot Anymore [9.696290050028237]
Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time.
arXiv Detail & Related papers (2023-12-26T21:17:46Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed. We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.