PLOD: An Abbreviation Detection Dataset for Scientific Documents
- URL: http://arxiv.org/abs/2204.12061v1
- Date: Tue, 26 Apr 2022 03:52:21 GMT
- Title: PLOD: An Abbreviation Detection Dataset for Scientific Documents
- Authors: Leonardo Zilio, Hadeel Saadany, Prashant Sharma, Diptesh Kanojia,
Constantin Orasan
- Abstract summary: PLOD is a large-scale dataset for abbreviation detection and extraction.
It contains 160k+ segments automatically annotated with abbreviations and their long forms.
We generate several baseline models for detecting abbreviations and long forms.
- Score: 8.085950562565893
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The detection and extraction of abbreviations from unstructured texts can
help to improve the performance of Natural Language Processing tasks, such as
machine translation and information retrieval. However, in terms of publicly
available datasets, there is not enough data for training
deep-neural-networks-based models to the point of generalising well over data.
This paper presents PLOD, a large-scale dataset for abbreviation detection and
extraction that contains 160k+ segments automatically annotated with
abbreviations and their long forms. We performed manual validation over a set
of instances and a complete automatic validation for this dataset. We then used
it to generate several baseline models for detecting abbreviations and long
forms. The best models achieved an F1-score of 0.92 for abbreviations and 0.89
for detecting their corresponding long forms. We release this dataset along
with our code and all the models publicly in
https://github.com/surrey-nlp/AbbreviationDetRepo.
Related papers
- Distantly Supervised Morpho-Syntactic Model for Relation Extraction [0.27195102129094995]
We present a method for the extraction and categorisation of an unrestricted set of relationships from text.
We evaluate our approach on six datasets built on Wikidata and Wikipedia.
arXiv Detail & Related papers (2024-01-18T14:17:40Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - ASDOT: Any-Shot Data-to-Text Generation with Pretrained Language Models [82.63962107729994]
Any-Shot Data-to-Text (ASDOT) is a new approach flexibly applicable to diverse settings.
It consists of two steps, data disambiguation and sentence fusion.
Experimental results show that ASDOT consistently achieves significant improvement over baselines.
arXiv Detail & Related papers (2022-10-09T19:17:43Z) - An Ensemble Approach to Acronym Extraction using Transformers [7.88595796865485]
Acronyms are abbreviated units of a phrase constructed by using initial components of the phrase in a text.
This paper discusses an ensemble approach for the task of Acronym Extraction.
arXiv Detail & Related papers (2022-01-09T14:49:46Z) - Document-Level Text Simplification: Dataset, Criteria and Baseline [75.58761130635824]
We define and investigate a new task of document-level text simplification.
Based on Wikipedia dumps, we first construct a large-scale dataset named D-Wikipedia.
We propose a new automatic evaluation metric called D-SARI that is more suitable for the document-level simplification task.
arXiv Detail & Related papers (2021-10-11T08:15:31Z) - DCoM: A Deep Column Mapper for Semantic Data Type Detection [0.0]
We introduce DCoM, a collection of multi-input NLP-based deep neural networks to detect semantic data types.
We train DCoM on 686,765 data columns extracted from VizNet corpus with 78 different semantic data types.
arXiv Detail & Related papers (2021-06-24T10:12:35Z) - Neural Data-to-Text Generation with LM-based Text Augmentation [27.822282190362856]
We show that a weakly supervised training paradigm is able to outperform fully supervised seq2seq models with less than 10% annotations.
By utilizing all annotated data, our model can boost the performance of a standard seq2seq model by over 5 BLEU points.
arXiv Detail & Related papers (2021-02-06T10:21:48Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.