From Dataset Recycling to Multi-Property Extraction and Beyond
- URL: http://arxiv.org/abs/2011.03228v1
- Date: Fri, 6 Nov 2020 08:22:12 GMT
- Title: From Dataset Recycling to Multi-Property Extraction and Beyond
- Authors: Tomasz Dwojak, Micha{\l} Pietruszka, {\L}ukasz Borchmann, Jakub
Ch{\l}\k{e}dowski, Filip Grali\'nski
- Abstract summary: This paper investigates various Transformer architectures on the WikiReading Information Extraction and Machine Reading dataset.
The proposed dual-source model outperforms the current state-of-the-art by a large margin.
We introduce WikiReading Recycled-a newly developed public dataset and the task of multiple property extraction.
- Score: 7.670897251425096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates various Transformer architectures on the WikiReading
Information Extraction and Machine Reading Comprehension dataset. The proposed
dual-source model outperforms the current state-of-the-art by a large margin.
Next, we introduce WikiReading Recycled-a newly developed public dataset and
the task of multiple property extraction. It uses the same data as WikiReading
but does not inherit its predecessor's identified disadvantages. In addition,
we provide a human-annotated test set with diagnostic subsets for a detailed
analysis of model performance.
Related papers
- Anno-incomplete Multi-dataset Detection [67.69438032767613]
We propose a novel problem as "-incomplete Multi-dataset Detection"
We develop an end-to-end multi-task learning architecture which can accurately detect all the object categories with multiple partially annotated datasets.
arXiv Detail & Related papers (2024-08-29T03:58:21Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - A Computational Analysis of Vagueness in Revisions of Instructional
Texts [2.2577978123177536]
We extract pairwise versions of an instruction before and after a revision was made.
We investigate the ability of a neural model to distinguish between two versions of an instruction in our data.
arXiv Detail & Related papers (2023-09-21T14:26:04Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - A Multi-Format Transfer Learning Model for Event Argument Extraction via
Variational Information Bottleneck [68.61583160269664]
Event argument extraction (EAE) aims to extract arguments with given roles from texts.
We propose a multi-format transfer learning model with variational information bottleneck.
We conduct extensive experiments on three benchmark datasets, and obtain new state-of-the-art performance on EAE.
arXiv Detail & Related papers (2022-08-27T13:52:01Z) - HowSumm: A Multi-Document Summarization Dataset Derived from WikiHow
Articles [8.53502615629675]
We present HowSumm, a novel large-scale dataset for the task of query-focused multi-document summarization (qMDS)
This use-case is different from the use-cases covered in existing multi-document summarization (MDS) datasets and is applicable to educational and industrial scenarios.
We describe the creation of the dataset and discuss the unique features that distinguish it from other summarization corpora.
arXiv Detail & Related papers (2021-10-07T04:44:32Z) - View Distillation with Unlabeled Data for Extracting Adverse Drug
Effects from User-Generated Data [21.0706831551535]
We present an algorithm for identifying Adverse Drug Reactions in social media data.
Our model relies on the properties of the problem and the characteristics of contextual word embeddings.
We evaluate our model in the largest publicly available ADR dataset.
arXiv Detail & Related papers (2021-05-24T15:38:08Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - On the Multi-Property Extraction and Beyond [7.670897251425096]
We investigate the Dual-source Transformer architecture on the WikiReading information extraction and machine reading comprehension dataset.
We introduce WikiReading Recycled - a newly developed public dataset, supporting the task of multiple property extraction.
arXiv Detail & Related papers (2020-06-15T11:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.