CABACE: Injecting Character Sequence Information and Domain Knowledge
for Enhanced Acronym and Long-Form Extraction
- URL: http://arxiv.org/abs/2112.13237v1
- Date: Sat, 25 Dec 2021 14:03:09 GMT
- Title: CABACE: Injecting Character Sequence Information and Domain Knowledge
for Enhanced Acronym and Long-Form Extraction
- Authors: Nithish Kannen, Divyanshu Sheth, Abhranil Chandra, Shubhraneel Pal
- Abstract summary: We propose a novel framework CABACE: Character-Aware BERT for ACronym Extraction.
It takes into account character sequences in text and is adapted to scientific and legal domains by masked language modelling.
We show that the proposed framework is better suited than baseline models for zero-shot generalization to non-English languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Acronyms and long-forms are commonly found in research documents, more so in
documents from scientific and legal domains. Many acronyms used in such
documents are domain-specific and are very rarely found in normal text corpora.
Owing to this, transformer-based NLP models often detect OOV (Out of
Vocabulary) for acronym tokens, especially for non-English languages, and their
performance suffers while linking acronyms to their long forms during
extraction. Moreover, pretrained transformer models like BERT are not
specialized to handle scientific and legal documents. With these points being
the overarching motivation behind this work, we propose a novel framework
CABACE: Character-Aware BERT for ACronym Extraction, which takes into account
character sequences in text and is adapted to scientific and legal domains by
masked language modelling. We further use an objective with an augmented loss
function, adding the max loss and mask loss terms to the standard cross-entropy
loss for training CABACE. We further leverage pseudo labelling and adversarial
data generation to improve the generalizability of the framework. Experimental
results prove the superiority of the proposed framework in comparison to
various baselines. Additionally, we show that the proposed framework is better
suited than baseline models for zero-shot generalization to non-English
languages, thus reinforcing the effectiveness of our approach. Our team
BacKGProp secured the highest scores on the French dataset, second-highest on
Danish and Vietnamese, and third-highest in the English-Legal dataset on the
global leaderboard for the acronym extraction (AE) shared task at SDU AAAI-22.
Related papers
- Memory Augmented Lookup Dictionary based Language Modeling for Automatic
Speech Recognition [20.926163659469587]
We propose a new memory augmented lookup dictionary based Transformer architecture for LM.
The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens.
Our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate.
arXiv Detail & Related papers (2022-12-30T22:26:57Z) - Enriching Relation Extraction with OpenIE [70.52564277675056]
Relation extraction (RE) is a sub-discipline of information extraction (IE)
In this work, we explore how recent approaches for open information extraction (OpenIE) may help to improve the task of RE.
Our experiments over two annotated corpora, KnowledgeNet and FewRel, demonstrate the improved accuracy of our enriched models.
arXiv Detail & Related papers (2022-12-19T11:26:23Z) - Embedding generation for text classification of Brazilian Portuguese
user reviews: from bag-of-words to transformers [0.0]
This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models.
It aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese.
arXiv Detail & Related papers (2022-12-01T15:24:19Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - DSGPT: Domain-Specific Generative Pre-Training of Transformers for Text
Generation in E-commerce Title and Review Summarization [14.414693156937782]
We propose a novel domain-specific generative pre-training (DS-GPT) method for text generation.
We apply it to the product titleand review summarization problems on E-commerce mobile display.
arXiv Detail & Related papers (2021-12-15T19:02:49Z) - PSG: Prompt-based Sequence Generation for Acronym Extraction [26.896811663334162]
We propose a Prompt-based Sequence Generation (PSG) method for the acronym extraction task.
Specifically, we design a template for prompting the extracted acronym texts with auto-regression.
A position extraction algorithm is designed for extracting the position of the generated answers.
arXiv Detail & Related papers (2021-11-29T02:14:38Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Leveraging Domain Agnostic and Specific Knowledge for Acronym
Disambiguation [5.766754189548904]
Acronym disambiguation aims to find the correct meaning of an ambiguous acronym in a text.
We propose a Hierarchical Dual-path BERT method coined hdBERT to capture the general fine-grained and high-level specific representations.
With a widely adopted SciAD dataset contained 62,441 sentences, we investigate the effectiveness of hdBERT.
arXiv Detail & Related papers (2021-07-01T09:10:00Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - What Does This Acronym Mean? Introducing a New Dataset for Acronym
Identification and Disambiguation [74.42107665213909]
Acronyms are the short forms of phrases that facilitate conveying lengthy sentences in documents and serve as one of the mainstays of writing.
Due to their importance, identifying acronyms and corresponding phrases (AI) and finding the correct meaning of each acronym (i.e., acronym disambiguation (AD)) are crucial for text understanding.
Despite the recent progress on this task, there are some limitations in the existing datasets which hinder further improvement.
arXiv Detail & Related papers (2020-10-28T00:12:36Z) - Structured Domain Adaptation with Online Relation Regularization for
Unsupervised Person Re-ID [62.90727103061876]
Unsupervised domain adaptation (UDA) aims at adapting the model trained on a labeled source-domain dataset to an unlabeled target-domain dataset.
We propose an end-to-end structured domain adaptation framework with an online relation-consistency regularization term.
Our proposed framework is shown to achieve state-of-the-art performance on multiple UDA tasks of person re-ID.
arXiv Detail & Related papers (2020-03-14T14:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.