Instruct and Extract: Instruction Tuning for On-Demand Information
Extraction
- URL: http://arxiv.org/abs/2310.16040v1
- Date: Tue, 24 Oct 2023 17:54:25 GMT
- Title: Instruct and Extract: Instruction Tuning for On-Demand Information
Extraction
- Authors: Yizhu Jiao, Ming Zhong, Sha Li, Ruining Zhao, Siru Ouyang, Heng Ji,
Jiawei Han
- Abstract summary: On-Demand Information Extraction aims to fulfill the personalized demands of real-world users.
We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set.
Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
- Score: 86.29491354355356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models with instruction-following capabilities open the door
to a wider group of users. However, when it comes to information extraction - a
classic task in natural language processing - most task-specific systems cannot
align well with long-tail ad hoc extraction use cases for non-expert users. To
address this, we propose a novel paradigm, termed On-Demand Information
Extraction, to fulfill the personalized demands of real-world users. Our task
aims to follow the instructions to extract the desired content from the
associated text and present it in a structured tabular format. The table
headers can either be user-specified or inferred contextually by the model. To
facilitate research in this emerging area, we present a benchmark named
InstructIE, inclusive of both automatically generated training data, as well as
the human-annotated test set. Building on InstructIE, we further develop an
On-Demand Information Extractor, ODIE. Comprehensive evaluations on our
benchmark reveal that ODIE substantially outperforms the existing open-source
models of similar size. Our code and dataset are released on
https://github.com/yzjiao/On-Demand-IE.
Related papers
- Leveraging Large Language Models for Web Scraping [0.0]
This research investigates a general-purpose accurate data scraping recipe for RAG models designed for language generation.
To capture knowledge in a more modular and interpretable way, we use pre trained language models with a latent knowledge retriever.
arXiv Detail & Related papers (2024-06-12T14:15:15Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - Benchmarking Large Language Models with Augmented Instructions for
Fine-grained Information Extraction [46.09887436555637]
This paper introduces a fine-grained IE benchmark dataset tailored for Large Language Models (LLMs)
Through extensive evaluations, we observe that encoder-decoder models, particularly T5 and FLAN-T5, perform well in generalizing to unseen information types.
arXiv Detail & Related papers (2023-10-08T09:41:18Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - PIVOINE: Instruction Tuning for Open-world Information Extraction [53.98073623222221]
We consider the problem of Open-world Information Extraction (Open-world IE), which extracts comprehensive entity profiles from unstructured texts.
We develop a large language model (LLM) that is able to perform Open-world IE to extract desirable entity profiles characterized by (possibly fine-grained) natural language instructions.
In particular, we construct INSTRUCTOPENWIKI, a substantial instruction tuning dataset for Open-world IE enriched with a comprehensive corpus, extensive annotations, and diverse instructions.
arXiv Detail & Related papers (2023-05-24T08:52:08Z) - ImPaKT: A Dataset for Open-Schema Knowledge Base Construction [10.073210304061966]
ImPaKT is a dataset for open-schema information extraction consisting of around 2500 text snippets from the C4 corpus, in the shopping domain (product buying guides)
We evaluate the power of this approach by fine-tuning the open source UL2 language model on a subset of the dataset, extracting a set of implication relations from a corpus of product buying guides, and conducting human evaluations of the resulting predictions.
arXiv Detail & Related papers (2022-12-21T05:02:49Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.