SCALE: Scaling up the Complexity for Advanced Language Model Evaluation
- URL: http://arxiv.org/abs/2306.09237v2
- Date: Fri, 1 Sep 2023 18:00:57 GMT
- Title: SCALE: Scaling up the Complexity for Advanced Language Model Evaluation
- Authors: Vishvaksenan Rasiah, Ronja Stern, Veton Matoshi, Matthias St\"urmer,
Ilias Chalkidis, Daniel E. Ho, Joel Niklaus
- Abstract summary: We introduce a novel NLP benchmark that poses challenges to current Large Language Models (LLMs)
Our benchmark comprises diverse legal NLP datasets from the Swiss legal system.
As part of our study, we evaluate several pre-trained multilingual language models on our benchmark to establish strong baselines as a point of reference.
- Score: 19.339580164451256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent strides in Large Language Models (LLMs) have saturated many NLP
benchmarks (even professional domain-specific ones), emphasizing the need for
novel, more challenging novel ones to properly assess LLM capabilities. In this
paper, we introduce a novel NLP benchmark that poses challenges to current LLMs
across four key dimensions: processing long documents (up to 50K tokens),
utilizing domain specific knowledge (embodied in legal texts), multilingual
understanding (covering five languages), and multitasking (comprising legal
document to document Information Retrieval, Court View Generation, Leading
Decision Summarization, Citation Extraction, and eight challenging Text
Classification tasks). Our benchmark comprises diverse legal NLP datasets from
the Swiss legal system, allowing for a comprehensive study of the underlying
Non-English, inherently multilingual, federal legal system. Despite recent
advances, efficiently processing long documents for intense review/analysis
tasks remains an open challenge for language models. Also, comprehensive,
domain-specific benchmarks requiring high expertise to develop are rare, as are
multilingual benchmarks. This scarcity underscores our contribution's value,
considering most public models are trained predominantly on English corpora,
while other languages remain understudied, particularly for practical
domain-specific NLP tasks. Our benchmark allows for testing and advancing the
state-of-the-art LLMs. As part of our study, we evaluate several pre-trained
multilingual language models on our benchmark to establish strong baselines as
a point of reference. Despite the large size of our datasets (tens to hundreds
of thousands of examples), existing publicly available models struggle with
most tasks, even after in-domain pretraining. We publish all resources
(benchmark suite, pre-trained models, code) under a fully permissive open CC
BY-SA license.
Related papers
- MEL: Legal Spanish Language Model [0.3651422140724638]
This paper presents the development and evaluation of MEL, a legal language model based on XLM-RoBERTa-large.
Evaluation benchmarks show a significant improvement over baseline models in understanding the legal Spanish language.
arXiv Detail & Related papers (2025-01-27T12:50:10Z) - ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models [0.0]
ArabLegalEval is a benchmark dataset for assessing the Arabic legal knowledge of Large Language Models (LLMs)
Inspired by the MMLU and LegalBench datasets, ArabLegalEval consists of multiple tasks sourced from Saudi legal documents and synthesized questions.
We aim to analyze the capabilities required to solve legal problems in Arabic and benchmark the performance of state-of-the-art LLMs.
arXiv Detail & Related papers (2024-08-15T07:09:51Z) - Legal Documents Drafting with Fine-Tuned Pre-Trained Large Language Model [1.3812010983144798]
This paper shows that we can leverage a large number of annotation-free legal documents without Chinese word segmentation to fine-tune a large-scale language model.
It can also achieve the generating legal document drafts task, and at the same time achieve the protection of information privacy and to improve information security issues.
arXiv Detail & Related papers (2024-06-06T16:00:20Z) - Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z) - Cross-lingual Text Classification Transfer: The Case of Ukrainian [11.508759658889382]
Ukrainian stands as a language that can benefit from the continued refinement of cross-lingual methodologies.
Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks.
In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods.
arXiv Detail & Related papers (2024-04-02T15:37:09Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset [0.0]
Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP)
We curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages.
We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance.
arXiv Detail & Related papers (2023-05-02T05:52:03Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - LAWDR: Language-Agnostic Weighted Document Representations from
Pre-trained Models [8.745407715423992]
Cross-lingual document representations enable language understanding in multilingual contexts.
Large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks.
arXiv Detail & Related papers (2021-06-07T07:14:00Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.