Related papers: Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics

Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics

URL: http://arxiv.org/abs/2402.12036v2
Date: Mon, 26 Feb 2024 16:47:36 GMT
Title: Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics
Authors: Anas Belfathi, Ygor Gallina, Nicolas Hernandez, Richard Dufour, Laura Monceaux
Abstract summary: We introduce an innovative masking approach leveraging genre and topicality information to tailor language models to specialized domains. Our method incorporates a ranking process that prioritizes words based on their significance, subsequently guiding the masking procedure. Experiments conducted using continual pre-training within the legal domain have underscored the efficacy of our approach on the LegalGLUE benchmark in the English language.
Score: 4.9639158834745745
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in pre-trained language modeling have facilitated significant progress across various natural language processing (NLP) tasks. Word masking during model training constitutes a pivotal component of language modeling in architectures like BERT. However, the prevalent method of word masking relies on random selection, potentially disregarding domain-specific linguistic attributes. In this article, we introduce an innovative masking approach leveraging genre and topicality information to tailor language models to specialized domains. Our method incorporates a ranking process that prioritizes words based on their significance, subsequently guiding the masking procedure. Experiments conducted using continual pre-training within the legal domain have underscored the efficacy of our approach on the LegalGLUE benchmark in the English language. Pre-trained language models and code are freely available for use.

Related papers

On The Landscape of Spoken Language Models: A Comprehensive Survey [144.11278973534203]
spoken language models (SLMs) act as universal speech processing systems. Work in this area is very diverse, with a range of terminology and evaluation settings.
arXiv Detail & Related papers (2025-04-11T13:40:53Z)
DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models [8.328673243329794]
This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea. Existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics. We propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning.
arXiv Detail & Related papers (2024-09-23T10:59:02Z)
Investigating Masking-based Data Generation in Language Models [0.0]
A feature of BERT and models with similar architecture is the objective of masked language modeling. Data augmentation is a data-driven technique widely used in machine learning. Recent studies have utilized masked language model to generate artificially augmented data for NLP downstream tasks.
arXiv Detail & Related papers (2023-06-16T16:48:27Z)
Self-Evolution Learning for Discriminative Language Model Pretraining [103.57103957631067]
Self-Evolution learning (SE) is a simple and effective token masking and learning method. SE focuses on learning the informative yet under-explored tokens and adaptively regularizes the training by introducing a novel Token-specific Label Smoothing approach.
arXiv Detail & Related papers (2023-05-24T16:00:54Z)
Unsupervised Improvement of Factual Knowledge in Language Models [4.5788796239850225]
Masked language modeling plays a key role in pretraining large language models. We propose an approach for influencing pretraining in a way that can improve language model performance on a variety of knowledge-intensive tasks.
arXiv Detail & Related papers (2023-04-04T07:37:06Z)
Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements. We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)
SLM: Learning a Discourse Language Representation with Sentence Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation. We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z)
Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation [63.195935452646815]
We propose a method to automatically generate a domain- and task-adaptive maskings of the given text for self-supervised pre-training. We present a novel reinforcement learning-based framework which learns the masking policy. We validate our Neural Mask Generator (NMG) on several question answering and text classification datasets.
arXiv Detail & Related papers (2020-10-06T13:27:01Z)
Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size. We propose a fully compositional output embedding layer for language models. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks. Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.