Efficient Domain Adaptation of Language Models via Adaptive Tokenization
- URL: http://arxiv.org/abs/2109.07460v1
- Date: Wed, 15 Sep 2021 17:51:27 GMT
- Title: Efficient Domain Adaptation of Language Models via Adaptive Tokenization
- Authors: Vin Sachidananda and Jason S. Kessler and Yi-an Lai
- Abstract summary: We show that domain-specific subword sequences can be efficiently determined directly from divergences in the conditional token distributions of the base and domain-specific corpora.
Our approach produces smaller models and less training and inference time than other approaches using tokenizer augmentation.
- Score: 5.058301279065432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual embedding-based language models trained on large data sets, such
as BERT and RoBERTa, provide strong performance across a wide range of tasks
and are ubiquitous in modern NLP. It has been observed that fine-tuning these
models on tasks involving data from domains different from that on which they
were pretrained can lead to suboptimal performance. Recent work has explored
approaches to adapt pretrained language models to new domains by incorporating
additional pretraining using domain-specific corpora and task data. We propose
an alternative approach for transferring pretrained language models to new
domains by adapting their tokenizers. We show that domain-specific subword
sequences can be efficiently determined directly from divergences in the
conditional token distributions of the base and domain-specific corpora. In
datasets from four disparate domains, we find adaptive tokenization on a
pretrained RoBERTa model provides >97% of the performance benefits of domain
specific pretraining. Our approach produces smaller models and less training
and inference time than other approaches using tokenizer augmentation. While
adaptive tokenization incurs a 6% increase in model parameters in our
experimentation, due to the introduction of 10k new domain-specific tokens, our
approach, using 64 vCPUs, is 72x faster than further pretraining the language
model on domain-specific corpora on 8 TPUs.
Related papers
- QAGAN: Adversarial Approach To Learning Domain Invariant Language
Features [0.76146285961466]
We explore adversarial training approach towards learning domain-invariant features.
We are able to achieve $15.2%$ improvement in EM score and $5.6%$ boost in F1 score on out-of-domain validation dataset.
arXiv Detail & Related papers (2022-06-24T17:42:18Z) - CLIN-X: pre-trained language models and a study on cross-task transfer
for concept extraction in the clinical domain [22.846469609263416]
We introduce the pre-trained CLIN-X (Clinical XLM-R) language models and show how CLIN-X outperforms other pre-trained transformer models.
Our studies reveal stable model performance despite a lack of annotated data with improvements of up to 47 F1 points when only 250 labeled sentences are available.
Our results highlight the importance of specialized language models as CLIN-X for concept extraction in non-standard domains.
arXiv Detail & Related papers (2021-12-16T10:07:39Z) - Non-Parametric Unsupervised Domain Adaptation for Neural Machine
Translation [61.27321597981737]
$k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor retrieval.
We propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval.
arXiv Detail & Related papers (2021-09-14T11:50:01Z) - Adapt-and-Distill: Developing Small, Fast and Effective Pretrained
Language Models for Domains [45.07506437436464]
We present a general approach to developing small, fast and effective pre-trained models for specific domains.
This is achieved by adapting the off-the-shelf general pre-trained models and performing task-agnostic knowledge distillation in target domains.
arXiv Detail & Related papers (2021-06-25T07:37:05Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Feature Adaptation of Pre-Trained Language Models across Languages and
Domains with Robust Self-Training [47.12438995938133]
We adapt pre-trained language models (PrLMs) to new domains without fine-tuning.
We present class-aware feature self-distillation (CFd) to learn discriminative features from PrLMs.
Experiments on two monolingual and multilingual Amazon review datasets show that CFd can consistently improve the performance of self-training.
arXiv Detail & Related papers (2020-09-24T08:04:37Z) - Don't Stop Pretraining: Adapt Language Models to Domains and Tasks [81.99843216550306]
We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks.
A second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains.
Adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining.
arXiv Detail & Related papers (2020-04-23T04:21:19Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.