From Dataset Recycling to Multi-Property Extraction and Beyond
- URL: http://arxiv.org/abs/2011.03228v1
- Date: Fri, 6 Nov 2020 08:22:12 GMT
- Title: From Dataset Recycling to Multi-Property Extraction and Beyond
- Authors: Tomasz Dwojak, Micha{\l} Pietruszka, {\L}ukasz Borchmann, Jakub
Ch{\l}\k{e}dowski, Filip Grali\'nski
- Abstract summary: This paper investigates various Transformer architectures on the WikiReading Information Extraction and Machine Reading dataset.
The proposed dual-source model outperforms the current state-of-the-art by a large margin.
We introduce WikiReading Recycled-a newly developed public dataset and the task of multiple property extraction.
- Score: 7.670897251425096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates various Transformer architectures on the WikiReading
Information Extraction and Machine Reading Comprehension dataset. The proposed
dual-source model outperforms the current state-of-the-art by a large margin.
Next, we introduce WikiReading Recycled-a newly developed public dataset and
the task of multiple property extraction. It uses the same data as WikiReading
but does not inherit its predecessor's identified disadvantages. In addition,
we provide a human-annotated test set with diagnostic subsets for a detailed
analysis of model performance.
Related papers
- A Computational Analysis of Vagueness in Revisions of Instructional
Texts [2.2577978123177536]
We extract pairwise versions of an instruction before and after a revision was made.
We investigate the ability of a neural model to distinguish between two versions of an instruction in our data.
arXiv Detail & Related papers (2023-09-21T14:26:04Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - A Multi-Format Transfer Learning Model for Event Argument Extraction via
Variational Information Bottleneck [68.61583160269664]
Event argument extraction (EAE) aims to extract arguments with given roles from texts.
We propose a multi-format transfer learning model with variational information bottleneck.
We conduct extensive experiments on three benchmark datasets, and obtain new state-of-the-art performance on EAE.
arXiv Detail & Related papers (2022-08-27T13:52:01Z) - HowSumm: A Multi-Document Summarization Dataset Derived from WikiHow
Articles [8.53502615629675]
We present HowSumm, a novel large-scale dataset for the task of query-focused multi-document summarization (qMDS)
This use-case is different from the use-cases covered in existing multi-document summarization (MDS) datasets and is applicable to educational and industrial scenarios.
We describe the creation of the dataset and discuss the unique features that distinguish it from other summarization corpora.
arXiv Detail & Related papers (2021-10-07T04:44:32Z) - View Distillation with Unlabeled Data for Extracting Adverse Drug
Effects from User-Generated Data [21.0706831551535]
We present an algorithm for identifying Adverse Drug Reactions in social media data.
Our model relies on the properties of the problem and the characteristics of contextual word embeddings.
We evaluate our model in the largest publicly available ADR dataset.
arXiv Detail & Related papers (2021-05-24T15:38:08Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - SupMMD: A Sentence Importance Model for Extractive Summarization using
Maximum Mean Discrepancy [92.5683788430012]
SupMMD is a novel technique for generic and update summarization based on the maximum discrepancy from kernel two-sample testing.
We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets.
arXiv Detail & Related papers (2020-10-06T09:26:55Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Concept Extraction Using Pointer-Generator Networks [86.75999352383535]
We propose a generic open-domain OOV-oriented extractive model that is based on distant supervision of a pointer-generator network.
The model has been trained on a large annotated corpus compiled specifically for this task from 250K Wikipedia pages.
arXiv Detail & Related papers (2020-08-25T22:28:14Z) - On the Multi-Property Extraction and Beyond [7.670897251425096]
We investigate the Dual-source Transformer architecture on the WikiReading information extraction and machine reading comprehension dataset.
We introduce WikiReading Recycled - a newly developed public dataset, supporting the task of multiple property extraction.
arXiv Detail & Related papers (2020-06-15T11:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.