The Large Language Model GreekLegalRoBERTa
- URL: http://arxiv.org/abs/2410.12852v1
- Date: Thu, 10 Oct 2024 20:54:39 GMT
- Title: The Large Language Model GreekLegalRoBERTa
- Authors: Vasileios Saketos, Despina-Athanasia Pantazi, Manolis Koubarakis,
- Abstract summary: We develop four versions of GreekLegalRoBERTa, which are four large language models trained on Greek legal and nonlegal text.
We show that our models surpass the performance of GreekLegalBERT, Greek- LegalBERT-v2, and GreekBERT in two tasks involving Greek legal documents.
- Score: 2.4797200957733576
- License:
- Abstract: We develop four versions of GreekLegalRoBERTa, which are four large language models trained on Greek legal and nonlegal text. We show that our models surpass the performance of GreekLegalBERT, Greek- LegalBERT-v2, and GreekBERT in two tasks involving Greek legal documents: named entity recognition and multi-class legal topic classification. We view our work as a contribution to the study of domain-specific NLP tasks in low-resource languages, like Greek, using modern NLP techniques and methodologies.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - EUROPA: A Legal Multilingual Keyphrase Generation Dataset [10.365070468192704]
We present EUROPA, a dataset for multilingual keyphrase generation in the legal domain.
It is derived from legal judgments from the Court of Justice of the European Union (EU) and contains instances in all 24 EU official languages.
arXiv Detail & Related papers (2024-03-01T03:30:38Z) - Exploring Large Language Models for Classical Philology [17.856304057963776]
We create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages.
We evaluate all models on morphological and syntactic tasks, including lemmatization.
Results show that our models provide significant improvements over the SoTA.
arXiv Detail & Related papers (2023-05-23T05:21:02Z) - Mitigating Data Imbalance and Representation Degeneration in
Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model.
We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z) - Simple Yet Effective Neural Ranking and Reranking Baselines for
Cross-Lingual Information Retrieval [50.882816288076725]
Cross-lingual information retrieval is the task of searching documents in one language with queries in another.
We provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold.
We implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese.
arXiv Detail & Related papers (2023-04-03T14:17:00Z) - GreekBART: The First Pretrained Greek Sequence-to-Sequence Model [13.429669368275318]
We introduce GreekBART, the first Seq2Seq model based on BART-base architecture and pretrained on a large-scale Greek corpus.
We evaluate and compare GreekBART against BART-random, Greek-BERT, and XLM-R on a variety of discriminative tasks.
arXiv Detail & Related papers (2023-04-03T10:48:51Z) - Multi-granular Legal Topic Classification on Greek Legislation [4.09134848993518]
We study the task of classifying legal texts written in the Greek language.
This is the first time the task of Greek legal text classification is considered in an open research project.
arXiv Detail & Related papers (2021-09-30T17:43:00Z) - Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents [56.40163943394202]
We release the Longformer-based pre-trained language model, named as Lawformer, for Chinese legal long documents understanding.
We evaluate Lawformer on a variety of LegalAI tasks, including judgment prediction, similar case retrieval, legal reading comprehension, and legal question answering.
arXiv Detail & Related papers (2021-05-09T09:39:25Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - GREEK-BERT: The Greeks visiting Sesame Street [25.406207104603027]
Transformer-based language models, such as BERT, have achieved state-of-the-art performance in several downstream natural language processing tasks.
We present GREEK-BERT, a monolingual BERT-based language model for modern Greek.
arXiv Detail & Related papers (2020-08-27T09:36:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.