HistBERT: A Pre-trained Language Model for Diachronic Lexical Semantic
Analysis
- URL: http://arxiv.org/abs/2202.03612v1
- Date: Tue, 8 Feb 2022 02:53:48 GMT
- Title: HistBERT: A Pre-trained Language Model for Diachronic Lexical Semantic
Analysis
- Authors: Wenjun Qiu, Yang Xu
- Abstract summary: We present a pre-trained BERT-based language model, HistBERT, trained on the balanced Corpus of Historical American English.
We report promising results in word similarity and semantic shift analysis.
- Score: 3.2851864672627618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contextualized word embeddings have demonstrated state-of-the-art performance
in various natural language processing tasks including those that concern
historical semantic change. However, language models such as BERT was trained
primarily on contemporary corpus data. To investigate whether training on
historical corpus data improves diachronic semantic analysis, we present a
pre-trained BERT-based language model, HistBERT, trained on the balanced Corpus
of Historical American English. We examine the effectiveness of our approach by
comparing the performance of the original BERT and that of HistBERT, and we
report promising results in word similarity and semantic shift analysis. Our
work suggests that the effectiveness of contextual embeddings in diachronic
semantic analysis is dependent on the temporal profile of the input text and
care should be taken in applying this methodology to study historical semantic
change.
Related papers
- Comparative Analysis of Static and Contextual Embeddings for Analyzing Semantic Changes in Medieval Latin Charters [6.883666189245419]
This paper presents the first computational analysis of semantic change pre- and post-Norman Conquest.
It is the first systematic comparison of static and contextual embeddings in a scarce historical data set.
Our findings confirm that, consistent with existing studies, contextual embeddings outperform static word embeddings in capturing semantic change.
arXiv Detail & Related papers (2024-10-11T22:19:17Z) - Distilling Monolingual and Crosslingual Word-in-Context Representations [18.87665111304974]
We propose a method that distils representations of word meaning in context from a pre-trained language model in both monolingual and crosslingual settings.
Our method does not require human-annotated corpora nor updates of the parameters of the pre-trained model.
Our method learns to combine the outputs of different hidden layers of the pre-trained model using self-attention.
arXiv Detail & Related papers (2024-09-13T11:10:16Z) - MACT: Model-Agnostic Cross-Lingual Training for Discourse Representation Structure Parsing [4.536003573070846]
We introduce a cross-lingual training strategy for semantic representation parsing models.
It exploits the alignments between languages encoded in pre-trained language models.
Experiments show significant improvements in DRS clause and graph parsing in English, German, Italian and Dutch.
arXiv Detail & Related papers (2024-06-03T07:02:57Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Transfer Learning across Several Centuries: Machine and Historian
Integrated Method to Decipher Royal Secretary's Diary [1.105375732595832]
NER in historical text has faced challenges such as scarcity of annotated corpus, multilanguage variety, various noise, and different convention far different from the contemporary language model.
This paper introduces Korean historical corpus (Diary of Royal secretary which is named SeungJeongWon) recorded over several centuries and recently added with named entity information as well as phrase markers which historians carefully annotated.
arXiv Detail & Related papers (2023-06-26T11:00:35Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - On the Sentence Embeddings from Pre-trained Language Models [78.45172445684126]
In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited.
We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity.
We propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective.
arXiv Detail & Related papers (2020-11-02T13:14:57Z) - GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight
Gated Injection Method [29.352569563032056]
We propose a novel method to explicitly inject linguistic knowledge in the form of word embeddings into a pre-trained BERT.
Our performance improvements on multiple semantic similarity datasets when injecting dependency-based and counter-fitted embeddings indicate that such information is beneficial and currently missing from the original model.
arXiv Detail & Related papers (2020-10-23T17:00:26Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models
via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results.
We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks.
Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.