IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus
- URL: http://arxiv.org/abs/2402.14710v3
- Date: Sun, 26 May 2024 15:54:41 GMT
- Title: IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus
- Authors: Honghao Gui, Lin Yuan, Hongbin Ye, Ningyu Zhang, Mengshu Sun, Lei Liang, Huajun Chen,
- Abstract summary: IEPile is a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens.
We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus.
Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization.
- Score: 38.27122981449957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.
Related papers
- Assessing the Performance of Chinese Open Source Large Language Models in Information Extraction Tasks [12.400599440431188]
Information Extraction (IE) plays a crucial role in Natural Language Processing (NLP)
Recent experiments focusing on English IE tasks have shed light on the challenges faced by Large Language Models (LLMs) in achieving optimal performance.
arXiv Detail & Related papers (2024-06-04T08:00:40Z) - ADELIE: Aligning Large Language Models on Information Extraction [55.60192044049083]
Large language models (LLMs) usually fall short on information extraction tasks.
In this paper, we introduce ADELIE, an aligned LLM that effectively solves various IE tasks.
We show that our models achieve state-of-the-art (SoTA) performance among open-source models.
arXiv Detail & Related papers (2024-05-08T12:24:52Z) - INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning [59.07490387145391]
Large language models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks.
Their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language.
We introduce a novel instruction tuning dataset, INTERS, encompassing 20 tasks across three fundamental IR categories.
arXiv Detail & Related papers (2024-01-12T12:10:28Z) - Large Language Models for Generative Information Extraction: A Survey [89.71273968283616]
Information extraction aims to extract structural knowledge from plain natural language texts.
generative Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation.
LLMs offer viable solutions for IE tasks based on a generative paradigm.
arXiv Detail & Related papers (2023-12-29T14:25:22Z) - GIELLM: Japanese General Information Extraction Large Language Model
Utilizing Mutual Reinforcement Effect [0.0]
We introduce the General Information Extraction Large Language Model (GIELLM)
It integrates text Classification, Sentiment Analysis, Named Entity Recognition, Relation Extraction, and Event Extraction using a uniform input-output schema.
This innovation marks the first instance of a model simultaneously handling such a diverse array of IE subtasks.
arXiv Detail & Related papers (2023-11-12T13:30:38Z) - Benchmarking Large Language Models with Augmented Instructions for
Fine-grained Information Extraction [46.09887436555637]
This paper introduces a fine-grained IE benchmark dataset tailored for Large Language Models (LLMs)
Through extensive evaluations, we observe that encoder-decoder models, particularly T5 and FLAN-T5, perform well in generalizing to unseen information types.
arXiv Detail & Related papers (2023-10-08T09:41:18Z) - Exploring Large Language Model for Graph Data Understanding in Online
Job Recommendations [63.19448893196642]
We present a novel framework that harnesses the rich contextual information and semantic representations provided by large language models to analyze behavior graphs.
By leveraging this capability, our framework enables personalized and accurate job recommendations for individual users.
arXiv Detail & Related papers (2023-07-10T11:29:41Z) - PIVOINE: Instruction Tuning for Open-world Information Extraction [53.98073623222221]
We consider the problem of Open-world Information Extraction (Open-world IE), which extracts comprehensive entity profiles from unstructured texts.
We develop a large language model (LLM) that is able to perform Open-world IE to extract desirable entity profiles characterized by (possibly fine-grained) natural language instructions.
In particular, we construct INSTRUCTOPENWIKI, a substantial instruction tuning dataset for Open-world IE enriched with a comprehensive corpus, extensive annotations, and diverse instructions.
arXiv Detail & Related papers (2023-05-24T08:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.