JuriBERT: A Masked-Language Model Adaptation for French Legal Text
- URL: http://arxiv.org/abs/2110.01485v1
- Date: Mon, 4 Oct 2021 14:51:24 GMT
- Title: JuriBERT: A Masked-Language Model Adaptation for French Legal Text
- Authors: Stella Douka, Hadi Abdine, Michalis Vazirgiannis, Rajaa El Hamdani,
David Restrepo Amariles
- Abstract summary: We focus on creating a language model adapted to French legal text with the goal of helping law professionals.
We conclude that some specific tasks do not benefit from generic language models pre-trained on large amounts of data.
We release JuriBERT, a new set of BERT models adapted to the French legal domain.
- Score: 14.330469316695853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models have proven to be very useful when adapted to specific
domains. Nonetheless, little research has been done on the adaptation of
domain-specific BERT models in the French language. In this paper, we focus on
creating a language model adapted to French legal text with the goal of helping
law professionals. We conclude that some specific tasks do not benefit from
generic language models pre-trained on large amounts of data. We explore the
use of smaller architectures in domain-specific sub-languages and their
benefits for French legal text. We prove that domain-specific pre-trained
models can perform better than their equivalent generalised ones in the legal
domain. Finally, we release JuriBERT, a new set of BERT models adapted to the
French legal domain.
Related papers
- TransformLLM: Adapting Large Language Models via LLM-Transformed Reading Comprehension Text [5.523385345486362]
We have developed language models specifically designed for legal applications.
Our innovative approach significantly improves capabilities in legal tasks by using Large Language Models (LLMs) to convert raw training data into reading comprehension text.
arXiv Detail & Related papers (2024-10-28T19:32:18Z) - Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts.
Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models.
The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z) - Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model
Collaboration [52.57055162778548]
Legal Judgment Prediction (LJP) has become an increasingly crucial task in Legal AI.
Precedents are the previous legal cases with similar facts, which are the basis for the judgment of the subsequent case in national legal systems.
Recent advances in deep learning have enabled a variety of techniques to be used to solve the LJP task.
arXiv Detail & Related papers (2023-10-13T16:47:20Z) - LegalRelectra: Mixed-domain Language Modeling for Long-range Legal Text
Comprehension [6.442209435258797]
LegalRelectra is a legal-domain language model trained on mixed-domain legal and medical corpora.
Our training architecture implements the Electra framework, but utilizes Reformer instead of BERT for its generator and discriminator.
arXiv Detail & Related papers (2022-12-16T00:15:14Z) - AraLegal-BERT: A pretrained language model for Arabic Legal text [0.399013650624183]
We introduce AraLegal-BERT, a bidirectional encoder Transformer-based model that have been thoroughly tested and carefully optimized.
We fine-tuned AraLegal-BERT and evaluated it against three BERT variations for Arabic language in three natural languages understanding (NLU) tasks.
The results show that the base version of AraLegal-BERT achieve better accuracy than the general and original BERT over the Legal text.
arXiv Detail & Related papers (2022-10-15T13:08:40Z) - MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model [17.566140528671134]
We show that a single multilingual domain-specific model can outperform the general multilingual model.
We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual.
arXiv Detail & Related papers (2021-09-14T11:50:26Z) - Learning Domain-Specialised Representations for Cross-Lingual Biomedical
Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL)
We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task.
We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z) - Comparing the Performance of NLP Toolkits and Evaluation measures in
Legal Tech [0.0]
We compare and analyze the pretrained Neural Language Models, XLNet (autoregressive), and BERT (autoencoder) on the Legal Tasks.
XLNet Model performs better on our Sequence Classification task of Legal Opinions Classification, whereas BERT produces better results on the NER task.
We use domain-specific pretraining and additional legal vocabulary to adapt BERT Model further to the Legal Domain.
arXiv Detail & Related papers (2021-03-12T11:06:32Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - DomBERT: Domain-oriented Language Model for Aspect-based Sentiment
Analysis [71.40586258509394]
We propose DomBERT, an extension of BERT to learn from both in-domain corpus and relevant domain corpora.
Experiments are conducted on an assortment of tasks in aspect-based sentiment analysis, demonstrating promising results.
arXiv Detail & Related papers (2020-04-28T21:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.