What Does This Acronym Mean? Introducing a New Dataset for Acronym
Identification and Disambiguation
- URL: http://arxiv.org/abs/2010.14678v1
- Date: Wed, 28 Oct 2020 00:12:36 GMT
- Title: What Does This Acronym Mean? Introducing a New Dataset for Acronym
Identification and Disambiguation
- Authors: Amir Pouran Ben Veyseh, Franck Dernoncourt, Quan Hung Tran, Thien Huu
Nguyen
- Abstract summary: Acronyms are the short forms of phrases that facilitate conveying lengthy sentences in documents and serve as one of the mainstays of writing.
Due to their importance, identifying acronyms and corresponding phrases (AI) and finding the correct meaning of each acronym (i.e., acronym disambiguation (AD)) are crucial for text understanding.
Despite the recent progress on this task, there are some limitations in the existing datasets which hinder further improvement.
- Score: 74.42107665213909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Acronyms are the short forms of phrases that facilitate conveying lengthy
sentences in documents and serve as one of the mainstays of writing. Due to
their importance, identifying acronyms and corresponding phrases (i.e., acronym
identification (AI)) and finding the correct meaning of each acronym (i.e.,
acronym disambiguation (AD)) are crucial for text understanding. Despite the
recent progress on this task, there are some limitations in the existing
datasets which hinder further improvement. More specifically, limited size of
manually annotated AI datasets or noises in the automatically created acronym
identification datasets obstruct designing advanced high-performing acronym
identification models. Moreover, the existing datasets are mostly limited to
the medical domain and ignore other domains. In order to address these two
limitations, we first create a manually annotated large AI dataset for
scientific domain. This dataset contains 17,506 sentences which is
substantially larger than previous scientific AI datasets. Next, we prepare an
AD dataset for scientific domain with 62,441 samples which is significantly
larger than the previous scientific AD dataset. Our experiments show that the
existing state-of-the-art models fall far behind human-level performance on
both datasets proposed by this work. In addition, we propose a new deep
learning model that utilizes the syntactical structure of the sentence to
expand an ambiguous acronym in a sentence. The proposed model outperforms the
state-of-the-art models on the new AD dataset, providing a strong baseline for
future research on this dataset.
Related papers
- Long-Tailed Anomaly Detection with Learnable Class Names [64.79139468331807]
We introduce several datasets with different levels of class imbalance and metrics for performance evaluation.
We then propose a novel method, LTAD, to detect defects from multiple and long-tailed classes, without relying on dataset class names.
LTAD substantially outperforms the state-of-the-art methods for most forms of dataset imbalance.
arXiv Detail & Related papers (2024-03-29T15:26:44Z) - ASDOT: Any-Shot Data-to-Text Generation with Pretrained Language Models [82.63962107729994]
Any-Shot Data-to-Text (ASDOT) is a new approach flexibly applicable to diverse settings.
It consists of two steps, data disambiguation and sentence fusion.
Experimental results show that ASDOT consistently achieves significant improvement over baselines.
arXiv Detail & Related papers (2022-10-09T19:17:43Z) - MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain
Acronym Extraction [66.60031336330547]
Acronyms and their expanded forms are necessary for various NLP applications.
One limitation of existing AE research is that they are limited to the English language and certain domains.
Lacking annotated datasets in multiple languages and domains has been a major issue to hinder research in this area.
arXiv Detail & Related papers (2022-02-19T23:08:38Z) - CABACE: Injecting Character Sequence Information and Domain Knowledge
for Enhanced Acronym and Long-Form Extraction [0.0]
We propose a novel framework CABACE: Character-Aware BERT for ACronym Extraction.
It takes into account character sequences in text and is adapted to scientific and legal domains by masked language modelling.
We show that the proposed framework is better suited than baseline models for zero-shot generalization to non-English languages.
arXiv Detail & Related papers (2021-12-25T14:03:09Z) - Document-Level Text Simplification: Dataset, Criteria and Baseline [75.58761130635824]
We define and investigate a new task of document-level text simplification.
Based on Wikipedia dumps, we first construct a large-scale dataset named D-Wikipedia.
We propose a new automatic evaluation metric called D-SARI that is more suitable for the document-level simplification task.
arXiv Detail & Related papers (2021-10-11T08:15:31Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Leveraging Domain Agnostic and Specific Knowledge for Acronym
Disambiguation [5.766754189548904]
Acronym disambiguation aims to find the correct meaning of an ambiguous acronym in a text.
We propose a Hierarchical Dual-path BERT method coined hdBERT to capture the general fine-grained and high-level specific representations.
With a widely adopted SciAD dataset contained 62,441 sentences, we investigate the effectiveness of hdBERT.
arXiv Detail & Related papers (2021-07-01T09:10:00Z) - BERT-based Acronym Disambiguation with Multiple Training Strategies [8.82012912690778]
Acronym disambiguation (AD) task aims to find the correct expansions of an ambiguous ancronym in a given sentence.
We propose a binary classification model incorporating BERT and several training strategies including dynamic negative sample selection.
Experiments on SciAD show the effectiveness of our proposed model and our score ranks 1st in SDU@AAAI-21 shared task 2: Acronym Disambiguation.
arXiv Detail & Related papers (2021-02-25T05:40:21Z) - Acronym Identification and Disambiguation Shared Tasks for Scientific
Document Understanding [41.63345823743157]
Acronyms are short forms of longer phrases frequently used in writing.
Every text understanding tool should be capable of recognizing acronyms in text.
To push forward research in this direction, we have organized two shared task for acronym identification and acronym disambiguation in scientific documents.
arXiv Detail & Related papers (2020-12-22T00:29:15Z) - Primer AI's Systems for Acronym Identification and Disambiguation [0.0]
We introduce new methods for acronym identification and disambiguation.
Our systems achieve significant performance gains over previously suggested methods.
Both of our systems perform competitively on the SDU@AAAI-21 shared task leaderboard.
arXiv Detail & Related papers (2020-12-14T23:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.