LegalRelectra: Mixed-domain Language Modeling for Long-range Legal Text
Comprehension
- URL: http://arxiv.org/abs/2212.08204v1
- Date: Fri, 16 Dec 2022 00:15:14 GMT
- Title: LegalRelectra: Mixed-domain Language Modeling for Long-range Legal Text
Comprehension
- Authors: Wenyue Hua, Yuchen Zhang, Zhe Chen, Josie Li, and Melanie Weber
- Abstract summary: LegalRelectra is a legal-domain language model trained on mixed-domain legal and medical corpora.
Our training architecture implements the Electra framework, but utilizes Reformer instead of BERT for its generator and discriminator.
- Score: 6.442209435258797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The application of Natural Language Processing (NLP) to specialized domains,
such as the law, has recently received a surge of interest. As many legal
services rely on processing and analyzing large collections of documents,
automating such tasks with NLP tools emerges as a key challenge. Many popular
language models, such as BERT or RoBERTa, are general-purpose models, which
have limitations on processing specialized legal terminology and syntax. In
addition, legal documents may contain specialized vocabulary from other
domains, such as medical terminology in personal injury text. Here, we propose
LegalRelectra, a legal-domain language model that is trained on mixed-domain
legal and medical corpora. We show that our model improves over general-domain
and single-domain medical and legal language models when processing
mixed-domain (personal injury) text. Our training architecture implements the
Electra framework, but utilizes Reformer instead of BERT for its generator and
discriminator. We show that this improves the model's performance on processing
long passages and results in better long-range text comprehension.
Related papers
- Legal Documents Drafting with Fine-Tuned Pre-Trained Large Language Model [1.3812010983144798]
This paper shows that we can leverage a large number of annotation-free legal documents without Chinese word segmentation to fine-tune a large-scale language model.
It can also achieve the generating legal document drafts task, and at the same time achieve the protection of information privacy and to improve information security issues.
arXiv Detail & Related papers (2024-06-06T16:00:20Z) - Improving Legal Judgement Prediction in Romanian with Long Text Encoders [0.8933959485129375]
We investigate specialized and general models for predicting the final ruling of a legal case, known as Legal Judgment Prediction (LJP)
In this work we focus on methods to extend to sequence length of Transformer-based models to better understand the long documents present in legal corpora.
arXiv Detail & Related papers (2024-02-29T13:52:33Z) - One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial Support [18.810320088441678]
This work introduces a novel NLP benchmark for the legal domain.
It challenges LLMs in five key dimensions: processing emphlong documents (up to 50K tokens), using emphdomain-specific knowledge (embodied in legal texts) and emphmultilingual understanding (covering five languages)
Our benchmark contains diverse datasets from the Swiss legal system, allowing for a comprehensive study of the underlying non-English, inherently multilingual legal system.
arXiv Detail & Related papers (2023-06-15T16:19:15Z) - SAILER: Structure-aware Pre-trained Language Model for Legal Case
Retrieval [75.05173891207214]
Legal case retrieval plays a core role in the intelligent legal system.
Most existing language models have difficulty understanding the long-distance dependencies between different structures.
We propose a new Structure-Aware pre-traIned language model for LEgal case Retrieval.
arXiv Detail & Related papers (2023-04-22T10:47:01Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Toward Improving Attentive Neural Networks in Legal Text Processing [0.20305676256390934]
In this dissertation, we present the main achievements in improving attentive neural networks in automatic legal document processing.
Language models tend to grow larger and larger, though, without expert knowledge, these models can still fail in domain adaptation.
arXiv Detail & Related papers (2022-03-15T20:45:22Z) - JuriBERT: A Masked-Language Model Adaptation for French Legal Text [14.330469316695853]
We focus on creating a language model adapted to French legal text with the goal of helping law professionals.
We conclude that some specific tasks do not benefit from generic language models pre-trained on large amounts of data.
We release JuriBERT, a new set of BERT models adapted to the French legal domain.
arXiv Detail & Related papers (2021-10-04T14:51:24Z) - Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents [56.40163943394202]
We release the Longformer-based pre-trained language model, named as Lawformer, for Chinese legal long documents understanding.
We evaluate Lawformer on a variety of LegalAI tasks, including judgment prediction, similar case retrieval, legal reading comprehension, and legal question answering.
arXiv Detail & Related papers (2021-05-09T09:39:25Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.