InstructIE: A Bilingual Instruction-based Information Extraction Dataset
- URL: http://arxiv.org/abs/2305.11527v4
- Date: Mon, 29 Jul 2024 03:41:34 GMT
- Title: InstructIE: A Bilingual Instruction-based Information Extraction Dataset
- Authors: Honghao Gui, Shuofei Qiao, Jintian Zhang, Hongbin Ye, Mengshu Sun, Lei Liang, Jeff Z. Pan, Huajun Chen, Ningyu Zhang,
- Abstract summary: Large language models can perform well on general natural language tasks, but their effectiveness is still suboptimal for information extraction (IE)
Recent works indicate that the main reason lies in the lack of extensive data on IE instructions.
We introduce InstructIE, a bilingual instruction-based IE dataset, which covers 12 diverse domains.
- Score: 44.65162892808696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models can perform well on general natural language tasks, but their effectiveness is still suboptimal for information extraction (IE). Recent works indicate that the main reason lies in the lack of extensive data on IE instructions. Note that the existing datasets on IE instructions not only have limited coverage but also involve high construction costs. To address this issue, we introduce InstructIE, a bilingual instruction-based IE dataset, which covers 12 diverse domains. We propose KG2Instruction, a framework specifically for the automatic generation of such datasets. Additionally, we manually annotate the test set. Experimental results demonstrate that large language models trained with InstructIE can not only obtain better IE capabilities but also enhance zero-shot performance compared with baselines.
Related papers
- ADELIE: Aligning Large Language Models on Information Extraction [55.60192044049083]
Large language models (LLMs) usually fall short on information extraction tasks.
In this paper, we introduce ADELIE, an aligned LLM that effectively solves various IE tasks.
We show that our models achieve state-of-the-art (SoTA) performance among open-source models.
arXiv Detail & Related papers (2024-05-08T12:24:52Z) - IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus [38.27122981449957]
IEPile is a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens.
We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus.
Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization.
arXiv Detail & Related papers (2024-02-22T17:11:38Z) - Instruct and Extract: Instruction Tuning for On-Demand Information
Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users.
We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set.
Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z) - WebIE: Faithful and Robust Information Extraction on the Web [7.361265860494963]
We present WebIE, the first large-scale, entity-linked closed IE dataset consisting of 1.6M sentences.
WebIE includes negative examples, i.e. sentences without fact triples, to better reflect the data on the web.
We evaluate the in-domain, out-of-domain, and zero-shot cross-lingual performance of generative IE models and find models trained on WebIE show better generalisability.
arXiv Detail & Related papers (2023-05-23T17:37:53Z) - Easy-to-Hard Learning for Information Extraction [57.827955646831526]
Information extraction systems aim to automatically extract structured information from unstructured texts.
We propose a unified easy-to-hard learning framework consisting of three stages, i.e., the easy stage, the hard stage, and the main stage.
By breaking down the learning process into multiple stages, our framework facilitates the model to acquire general IE task knowledge and improve its generalization ability.
arXiv Detail & Related papers (2023-05-16T06:04:14Z) - CodeIE: Large Code Generation Models are Better Few-Shot Information
Extractors [92.17328076003628]
Large language models (LLMs) pre-trained on massive corpora have demonstrated impressive few-shot learning ability on many NLP tasks.
In this paper, we propose to recast the structured output in the form of code instead of natural language.
arXiv Detail & Related papers (2023-05-09T18:40:31Z) - ChatIE: Zero-Shot Information Extraction via Chatting with ChatGPT [89.49161588240061]
Zero-shot information extraction (IE) aims to build IE systems from the unannotated text.
Recent efforts on large language models (LLMs, e.g., GPT-3, ChatGPT) show promising performance on zero-shot settings.
We transform the zero-shot IE task into a multi-turn question-answering problem with a two-stage framework (ChatIE)
arXiv Detail & Related papers (2023-02-20T12:57:12Z) - LSOIE: A Large-Scale Dataset for Supervised Open Information Extraction [0.9966318185310058]
We introduce a new dataset by converting the QA-SRL 2.0 dataset to a large-scale Open Information Extraction (OIE) dataset (LSOIE)
Our LSOIE dataset is 20 times larger than the next largest human-annotated OIE dataset (LSOIE)
arXiv Detail & Related papers (2021-01-27T02:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.