Can BERT Dig It? -- Named Entity Recognition for Information Retrieval
in the Archaeology Domain
- URL: http://arxiv.org/abs/2106.07742v1
- Date: Mon, 14 Jun 2021 20:26:19 GMT
- Title: Can BERT Dig It? -- Named Entity Recognition for Information Retrieval
in the Archaeology Domain
- Authors: Alex Brandsen, Suzan Verberne, Karsten Lambers, Milco Wansleeben
- Abstract summary: ArcheoBERTje is a BERT model pre-trained on Dutch archaeological texts.
We analyse the differences between the vocabulary and output of the BERT models on the full collection.
- Score: 3.928604516640069
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The amount of archaeological literature is growing rapidly. Until recently,
these data were only accessible through metadata search. We implemented a text
retrieval engine for a large archaeological text collection ($\sim 658$ Million
words). In archaeological IR, domain-specific entities such as locations, time
periods, and artefacts, play a central role. This motivated the development of
a named entity recognition (NER) model to annotate the full collection with
archaeological named entities. In this paper, we present ArcheoBERTje, a BERT
model pre-trained on Dutch archaeological texts. We compare the model's quality
and output on a Named Entity Recognition task to a generic multilingual model
and a generic Dutch model. We also investigate ensemble methods for combining
multiple BERT models, and combining the best BERT model with a domain thesaurus
using Conditional Random Fields (CRF). We find that ArcheoBERTje outperforms
both the multilingual and Dutch model significantly with a smaller standard
deviation between runs, reaching an average F1 score of 0.735. The model also
outperforms ensemble methods combining the three models. Combining ArcheoBERTje
predictions and explicit domain knowledge from the thesaurus did not increase
the F1 score. We quantitatively and qualitatively analyse the differences
between the vocabulary and output of the BERT models on the full collection and
provide some valuable insights in the effect of fine-tuning for specific
domains. Our results indicate that for a highly specific text domain such as
archaeology, further pre-training on domain-specific data increases the model's
quality on NER by a much larger margin than shown for other domains in the
literature, and that domain-specific pre-training makes the addition of domain
knowledge from a thesaurus unnecessary.
Related papers
- Learning to Generalize Unseen Domains via Multi-Source Meta Learning for Text Classification [71.08024880298613]
We study the multi-source Domain Generalization of text classification.
We propose a framework to use multiple seen domains to train a model that can achieve high accuracy in an unseen domain.
arXiv Detail & Related papers (2024-09-20T07:46:21Z) - A Few-Shot Approach for Relation Extraction Domain Adaptation using Large Language Models [1.3927943269211591]
This paper experiments with leveraging in-context learning capabilities of Large Language Models to perform data annotation.
We show that by using a few-shot learning strategy with structured prompts and only minimal expert annotation the presented approach can potentially support domain adaptation of a science KG generation model.
arXiv Detail & Related papers (2024-08-05T11:06:36Z) - A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation [52.0964459842176]
Current state-of-the-art dialogue systems heavily rely on extensive training datasets.
We propose a novel data textbfAugmentation framework for textbfMulti-textbfDomain textbfDialogue textbfGeneration, referred to as textbfAMD$2$G.
The AMD$2$G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training.
arXiv Detail & Related papers (2024-06-14T09:52:27Z) - NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from
Native Speaker Texts [51.64770549988806]
We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains.
To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination.
We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data.
arXiv Detail & Related papers (2023-05-25T13:05:52Z) - Adapting Knowledge for Few-shot Table-to-Text Generation [35.59842534346997]
We propose a novel framework: Adapt-Knowledge-to-Generate (AKG)
AKG adapts unlabeled domain-specific knowledge into the model, which brings at least three benefits.
Our model achieves superior performance in terms of both fluency and accuracy as judged by human and automatic evaluations.
arXiv Detail & Related papers (2023-02-24T05:48:53Z) - Automatic Creation of Named Entity Recognition Datasets by Querying
Phrase Representations [20.00016240535205]
Most weakly supervised named entity recognition models rely on domain-specific dictionaries provided by experts.
We present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries.
We demonstrate that HighGEN outperforms the previous best model by an average F1 score of 4.7 across five NER benchmark datasets.
arXiv Detail & Related papers (2022-10-14T07:36:44Z) - Improving Retrieval Augmented Neural Machine Translation by Controlling
Source and Fuzzy-Match Interactions [15.845071122977158]
We build on the idea of Retrieval Augmented Translation (RAT) where top-k in-domain fuzzy matches are found for the source sentence.
We propose a novel architecture to control interactions between a source sentence and the top-k fuzzy target-language matches.
arXiv Detail & Related papers (2022-10-10T23:33:15Z) - HRKD: Hierarchical Relational Knowledge Distillation for Cross-domain
Language Model Compression [53.90578309960526]
Large pre-trained language models (PLMs) have shown overwhelming performances compared with traditional neural network methods.
We propose a hierarchical relational knowledge distillation (HRKD) method to capture both hierarchical and domain relational information.
arXiv Detail & Related papers (2021-10-16T11:23:02Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Zero-Resource Cross-Domain Named Entity Recognition [68.83177074227598]
Existing models for cross-domain named entity recognition rely on numerous unlabeled corpus or labeled NER training data in target domains.
We propose a cross-domain NER model that does not use any external resources.
arXiv Detail & Related papers (2020-02-14T09:04:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.