LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain
- URL: http://arxiv.org/abs/2301.13126v3
- Date: Mon, 8 Jan 2024 10:08:40 GMT
- Title: LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain
- Authors: Joel Niklaus, Veton Matoshi, Pooja Rani, Andrea Galassi, Matthias
St\"urmer, Ilias Chalkidis
- Abstract summary: We survey the legal NLP literature and select 11 datasets covering 24 languages, creating LEXTREME.
The best baseline (XLM-R large) achieves both a dataset aggregate score a language aggregate score of 61.3.
This indicates that LEXTREME is still very challenging and leaves ample room for improvement.
- Score: 24.54412069999257
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Lately, propelled by the phenomenal advances around the transformer
architecture, the legal NLP field has enjoyed spectacular growth. To measure
progress, well curated and challenging benchmarks are crucial. However, most
benchmarks are English only and in legal NLP specifically there is no
multilingual benchmark available yet. Additionally, many benchmarks are
saturated, with the best models clearly outperforming the best humans and
achieving near perfect scores. We survey the legal NLP literature and select 11
datasets covering 24 languages, creating LEXTREME. To provide a fair
comparison, we propose two aggregate scores, one based on the datasets and one
on the languages. The best baseline (XLM-R large) achieves both a dataset
aggregate score a language aggregate score of 61.3. This indicates that
LEXTREME is still very challenging and leaves ample room for improvement. To
make it easy for researchers and practitioners to use, we release LEXTREME on
huggingface together with all the code required to evaluate models and a public
Weights and Biases project with all the runs.
Related papers
- Benchmarking Pre-trained Large Language Models' Potential Across Urdu NLP tasks [0.9786690381850356]
Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research.
This study presents an in-depth examination of prominent LLMs, across 14 tasks using 15 Urdu datasets.
Experiments show that SOTA models surpass all the encoder-decoder pre-trained language models in all Urdu NLP tasks with zero-shot learning.
arXiv Detail & Related papers (2024-05-24T11:30:37Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning:
Insights and Observations [90.73517523001149]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct.
We propose different training strategies to build powerful xMR LLMs, named MathOctopus, notably outperform conventional open-source LLMs.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - This is the way: designing and compiling LEPISZCZE, a comprehensive NLP
benchmark for Polish [5.8090623549313944]
We introduce LEPISZCZE, a new, comprehensive benchmark for Polish NLP.
We use five datasets from the Polish benchmark and add eight novel datasets.
We provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
arXiv Detail & Related papers (2022-11-23T16:51:09Z) - IndicSUPERB: A Speech Processing Universal Performance Benchmark for
Indian languages [16.121708272597154]
We release the IndicSUPERB benchmark for speech recognition in 12 Indian languages.
We train and evaluate different self-supervised models alongside a commonly used baseline benchmark.
We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks.
arXiv Detail & Related papers (2022-08-24T20:14:52Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - KLEJ: Comprehensive Benchmark for Polish Language Understanding [4.702729080310267]
We introduce a comprehensive multi-task benchmark for the Polish language understanding, accompanied by an online leaderboard.
We also release HerBERT, a Transformer-based model trained specifically for the Polish language, which has the best average performance and obtains the best results for three out of nine tasks.
arXiv Detail & Related papers (2020-05-01T21:55:40Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.