Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest
- URL: http://arxiv.org/abs/2502.11275v1
- Date: Sun, 16 Feb 2025 21:32:20 GMT
- Title: Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest
- Authors: Letian Peng, Zilong Wang, Feng Yao, Jingbo Shang,
- Abstract summary: We show that information extraction models can act as free riders on large language models (LLMs) resources.<n>We show that IE models can act as free riders on LLM resources by reframing next-token emphprediction into emphextraction for tokens already present in the context.<n>Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, emphCuckoo, with 102.6M extractive data converted from LLM's pre-training and post-training data.
- Score: 36.58490792678384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token \emph{prediction} into \emph{extraction} for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, \emph{Cuckoo}, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.
Related papers
- LLM-based Semantic Augmentation for Harmful Content Detection [5.954202581988127]
This paper introduces an approach that prompts large language models to clean noisy text and provide context-rich explanations.
We evaluate on the SemEval 2024 multi-label Persuasive Meme dataset and validate on the Google Jigsaw toxic comments and Facebook hateful memes datasets.
Our results reveal that zero-shot LLM classification underperforms on these high-context tasks compared to supervised models.
arXiv Detail & Related papers (2025-04-22T02:59:03Z) - Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We present a framework that selects high-quality pretraining data without any LLM training of our own.
We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations.
Our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM.
arXiv Detail & Related papers (2024-09-09T17:23:29Z) - Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs [61.04246774006429]
We introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent.<n>We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements.<n>Our findings show that instruction-tuned models can expose pre-training data as much as their base-models, if not more so, and using instructions proposed by other LLMs can open a new avenue of automated attacks.
arXiv Detail & Related papers (2024-03-05T19:32:01Z) - In-Context Unlearning: Language Models as Few Shot Unlearners [27.962361828354716]
We propose a new class of unlearning methods for Large Language Models (LLMs)
This method unlearns instances from the model by simply providing specific kinds of inputs in context, without the need to update model parameters.
Our experimental results demonstrate that in-context unlearning performs on par with, or in some cases outperforms other state-of-the-art methods that require access to model parameters.
arXiv Detail & Related papers (2023-10-11T15:19:31Z) - Pre-training with Synthetic Data Helps Offline Reinforcement Learning [4.531082205797088]
We show that language is not essential for improved performance.
We then consider pre-training Conservative Q-Learning (CQL), a popular offline DRL algorithm.
Surprisingly, pre-training with simple synthetic data for a small number of updates can also improve CQL.
arXiv Detail & Related papers (2023-10-01T19:32:14Z) - ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation [43.270424225285105]
We focus on adapting and empowering a pure large language model for zero-shot and few-shot recommendation tasks.
We propose Retrieval-enhanced Large Language models (ReLLa) for recommendation tasks in both zero-shot and few-shot settings.
arXiv Detail & Related papers (2023-08-22T02:25:04Z) - CodeIE: Large Code Generation Models are Better Few-Shot Information
Extractors [92.17328076003628]
Large language models (LLMs) pre-trained on massive corpora have demonstrated impressive few-shot learning ability on many NLP tasks.
In this paper, we propose to recast the structured output in the form of code instead of natural language.
arXiv Detail & Related papers (2023-05-09T18:40:31Z) - IELM: An Open Information Extraction Benchmark for Pre-Trained Language
Models [75.48081086368606]
We introduce a new open information extraction (OIE) benchmark for pre-trained language models (LM)
We create an OIE benchmark aiming to fully examine the open relational information present in the pre-trained LMs.
Surprisingly, pre-trained LMs are able to obtain competitive performance on both standard OIE datasets.
arXiv Detail & Related papers (2022-10-25T16:25:00Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - COCO-LM: Correcting and Contrasting Text Sequences for Language Model
Pretraining [59.169836983883656]
COCO-LM is a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences.
COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences.
Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.
arXiv Detail & Related papers (2021-02-16T22:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.