Linguistically Informed Masking for Representation Learning in the
Patent Domain
- URL: http://arxiv.org/abs/2106.05768v1
- Date: Thu, 10 Jun 2021 14:20:57 GMT
- Title: Linguistically Informed Masking for Representation Learning in the
Patent Domain
- Authors: Sophia Althammer, Mark Buckley, Sebastian Hofst\"atter, Allan Hanbury
- Abstract summary: We propose the empirically motivated Linguistically Informed Masking (LIM) method to focus domain-adaptative pre-training on the linguistic patterns of patents.
We quantify the relevant differences between patent, scientific and general-purpose language.
We demonstrate the impact of balancing the learning from different information sources during domain adaptation for the patent domain.
- Score: 7.911344873839031
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Domain-specific contextualized language models have demonstrated substantial
effectiveness gains for domain-specific downstream tasks, like similarity
matching, entity recognition or information retrieval. However successfully
applying such models in highly specific language domains requires domain
adaptation of the pre-trained models. In this paper we propose the empirically
motivated Linguistically Informed Masking (LIM) method to focus
domain-adaptative pre-training on the linguistic patterns of patents, which use
a highly technical sublanguage. We quantify the relevant differences between
patent, scientific and general-purpose language and demonstrate for two
different language models (BERT and SciBERT) that domain adaptation with LIM
leads to systematically improved representations by evaluating the performance
of the domain-adapted representations of patent language on two independent
downstream tasks, the IPC classification and similarity matching. We
demonstrate the impact of balancing the learning from different information
sources during domain adaptation for the patent domain. We make the source code
as well as the domain-adaptive pre-trained patent language models publicly
available at https://github.com/sophiaalthammer/patent-lim.
Related papers
- DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models [8.328673243329794]
This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea.
Existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics.
We propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning.
arXiv Detail & Related papers (2024-09-23T10:59:02Z) - Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts.
Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models.
The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z) - Boosting Large Language Models with Continual Learning for Aspect-based Sentiment Analysis [33.86086075084374]
Aspect-based sentiment analysis (ABSA) is an important subtask of sentiment analysis.
We propose a Large Language Model-based Continual Learning (textttLLM-CL) model for ABSA.
arXiv Detail & Related papers (2024-05-09T02:00:07Z) - Unified Language-driven Zero-shot Domain Adaptation [55.64088594551629]
Unified Language-driven Zero-shot Domain Adaptation (ULDA) is a novel task setting.
It enables a single model to adapt to diverse target domains without explicit domain-ID knowledge.
arXiv Detail & Related papers (2024-04-10T16:44:11Z) - Adapt in Contexts: Retrieval-Augmented Domain Adaptation via In-Context
Learning [48.22913073217633]
Large language models (LLMs) have showcased their capability with few-shot inference known as in-context learning.
In this paper, we study the UDA problem under an in-context learning setting to adapt language models from the source domain to the target domain without any target labels.
We devise different prompting and training strategies, accounting for different LM architectures to learn the target distribution via language modeling.
arXiv Detail & Related papers (2023-11-20T06:06:20Z) - Domain Private Transformers for Multi-Domain Dialog Systems [2.7013801448234367]
This paper proposes domain privacy as a novel way to quantify how likely a conditional language model will leak across domains.
Experiments on membership inference attacks show that our proposed method has comparable resiliency to methods adapted from recent literature on differentially private language models.
arXiv Detail & Related papers (2023-05-23T16:27:12Z) - SwitchPrompt: Learning Domain-Specific Gated Soft Prompts for
Classification in Low-Resource Domains [14.096170976149521]
SwitchPrompt is a novel and lightweight prompting methodology for adaptation of language models trained on datasets from the general domain to diverse low-resource domains.
Our few-shot experiments on three text classification benchmarks demonstrate the efficacy of the general-domain pre-trained language models when used with SwitchPrompt.
They often even outperform their domain-specific counterparts trained with baseline state-of-the-art prompting methods by up to 10.7% performance increase in accuracy.
arXiv Detail & Related papers (2023-02-14T07:14:08Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - DomBERT: Domain-oriented Language Model for Aspect-based Sentiment
Analysis [71.40586258509394]
We propose DomBERT, an extension of BERT to learn from both in-domain corpus and relevant domain corpora.
Experiments are conducted on an assortment of tasks in aspect-based sentiment analysis, demonstrating promising results.
arXiv Detail & Related papers (2020-04-28T21:07:32Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.