LSOIE: A Large-Scale Dataset for Supervised Open Information Extraction
- URL: http://arxiv.org/abs/2101.11177v1
- Date: Wed, 27 Jan 2021 02:49:26 GMT
- Title: LSOIE: A Large-Scale Dataset for Supervised Open Information Extraction
- Authors: Jacob Solawetz, Stefan Larson
- Abstract summary: We introduce a new dataset by converting the QA-SRL 2.0 dataset to a large-scale Open Information Extraction (OIE) dataset (LSOIE)
Our LSOIE dataset is 20 times larger than the next largest human-annotated OIE dataset (LSOIE)
- Score: 0.9966318185310058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open Information Extraction (OIE) systems seek to compress the factual
propositions of a sentence into a series of n-ary tuples. These tuples are
useful for downstream tasks in natural language processing like knowledge base
creation, textual entailment, and natural language understanding. However,
current OIE datasets are limited in both size and diversity. We introduce a new
dataset by converting the QA-SRL 2.0 dataset to a large-scale OIE dataset
(LSOIE). Our LSOIE dataset is 20 times larger than the next largest
human-annotated OIE dataset. We construct and evaluate several benchmark OIE
models on LSOIE, providing baselines for future improvements on the task. Our
LSOIE data, models, and code are made publicly available
Related papers
- ADELIE: Aligning Large Language Models on Information Extraction [55.60192044049083]
Large language models (LLMs) usually fall short on information extraction tasks.
In this paper, we introduce ADELIE, an aligned LLM that effectively solves various IE tasks.
We show that our models achieve state-of-the-art (SoTA) performance among open-source models.
arXiv Detail & Related papers (2024-05-08T12:24:52Z) - Leveraging Linguistically Enhanced Embeddings for Open Information Extraction [0.0]
Open Information Extraction (OIE) is a structured prediction task in Natural Language Processing (NLP)
We are the first to leverage linguistic features with a Seq2Seq PLM for OIE.
Our work can give any neural OIE architecture the key performance boost from both PLMs and linguistic features in one go.
arXiv Detail & Related papers (2024-03-20T18:18:48Z) - IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus [38.27122981449957]
IEPile is a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens.
We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus.
Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization.
arXiv Detail & Related papers (2024-02-22T17:11:38Z) - Large Language Models for Generative Information Extraction: A Survey [89.71273968283616]
Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation.
We present an extensive overview by categorizing these works in terms of various IE subtasks and techniques.
We empirically analyze the most advanced methods and discover the emerging trend of IE tasks with LLMs.
arXiv Detail & Related papers (2023-12-29T14:25:22Z) - Instruct and Extract: Instruction Tuning for On-Demand Information
Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users.
We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set.
Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z) - LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset [75.9621305227523]
We introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art large language models (LLMs)
This dataset is collected from 210K IP addresses in the wild on our Vicuna demo and Arena website.
We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions.
arXiv Detail & Related papers (2023-09-21T12:13:55Z) - InstructIE: A Bilingual Instruction-based Information Extraction Dataset [44.65162892808696]
Large language models can perform well on general natural language tasks, but their effectiveness is still suboptimal for information extraction (IE)
Recent works indicate that the main reason lies in the lack of extensive data on IE instructions.
We introduce InstructIE, a bilingual instruction-based IE dataset, which covers 12 diverse domains.
arXiv Detail & Related papers (2023-05-19T08:51:11Z) - IELM: An Open Information Extraction Benchmark for Pre-Trained Language
Models [75.48081086368606]
We introduce a new open information extraction (OIE) benchmark for pre-trained language models (LM)
We create an OIE benchmark aiming to fully examine the open relational information present in the pre-trained LMs.
Surprisingly, pre-trained LMs are able to obtain competitive performance on both standard OIE datasets.
arXiv Detail & Related papers (2022-10-25T16:25:00Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.