This is the way: designing and compiling LEPISZCZE, a comprehensive NLP
benchmark for Polish
- URL: http://arxiv.org/abs/2211.13112v1
- Date: Wed, 23 Nov 2022 16:51:09 GMT
- Title: This is the way: designing and compiling LEPISZCZE, a comprehensive NLP
benchmark for Polish
- Authors: {\L}ukasz Augustyniak, Kamil Tagowski, Albert Sawczyn, Denis Janiak,
Roman Bartusiak, Adrian Szymczak, Marcin W\k{a}troba, Arkadiusz Janz, Piotr
Szyma\'nski, Miko{\l}aj Morzy, Tomasz Kajdanowicz, Maciej Piasecki
- Abstract summary: We introduce LEPISZCZE, a new, comprehensive benchmark for Polish NLP.
We use five datasets from the Polish benchmark and add eight novel datasets.
We provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
- Score: 5.8090623549313944
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The availability of compute and data to train larger and larger language
models increases the demand for robust methods of benchmarking the true
progress of LM training. Recent years witnessed significant progress in
standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or
KILT have become de facto standard tools to compare large language models.
Following the trend to replicate GLUE for other languages, the KLEJ benchmark
has been released for Polish. In this paper, we evaluate the progress in
benchmarking for low-resourced languages. We note that only a handful of
languages have such comprehensive benchmarks. We also note the gap in the
number of tasks being evaluated by benchmarks for resource-rich English/Chinese
and the rest of the world. In this paper, we introduce LEPISZCZE (the Polish
word for glew, the Middle English predecessor of glue), a new, comprehensive
benchmark for Polish NLP with a large variety of tasks and high-quality
operationalization of the benchmark. We design LEPISZCZE with flexibility in
mind. Including new models, datasets, and tasks is as simple as possible while
still offering data versioning and model tracking. In the first run of the
benchmark, we test 13 experiments (task and dataset pairs) based on the five
most recent LMs for Polish. We use five datasets from the Polish benchmark and
add eight novel datasets. As the paper's main contribution, apart from
LEPISZCZE, we provide insights and experiences learned while creating the
benchmark for Polish as the blueprint to design similar benchmarks for other
low-resourced languages.
Related papers
- DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - NLPre: a revised approach towards language-centric benchmarking of Natural Language Preprocessing systems [2.141587359797428]
It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-based morphological analysers or dictionaries.
Inspired by the GLUE benchmark, the proposed language-centric benchmarking system enables comprehensive ongoing evaluation of multiple NLPre tools.
The prototype application is configured for Polish and integrated with the thoroughly assembled NLPre-PL benchmark.
arXiv Detail & Related papers (2024-03-07T14:07:00Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - KLUE: Korean Language Understanding Evaluation [43.94952771238633]
We introduce Korean Language Understanding Evaluation (KLUE) benchmark.
KLUE is a collection of 8 Korean natural language understanding (NLU) tasks.
We build all of the tasks from scratch from diverse source corpora while respecting copyrights.
arXiv Detail & Related papers (2021-05-20T11:40:30Z) - MOROCCO: Model Resource Comparison Framework [61.444083353087294]
We present MOROCCO, a framework to compare language models compatible with ttjiant environment which supports over 50 NLU tasks.
We demonstrate its applicability for two GLUE-like suites in different languages.
arXiv Detail & Related papers (2021-04-29T13:01:27Z) - KLEJ: Comprehensive Benchmark for Polish Language Understanding [4.702729080310267]
We introduce a comprehensive multi-task benchmark for the Polish language understanding, accompanied by an online leaderboard.
We also release HerBERT, a Transformer-based model trained specifically for the Polish language, which has the best average performance and obtains the best results for three out of nine tasks.
arXiv Detail & Related papers (2020-05-01T21:55:40Z) - CLUE: A Chinese Language Understanding Evaluation Benchmark [41.86950255312653]
We introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark.
CLUE brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension.
We report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models.
arXiv Detail & Related papers (2020-04-13T15:02:29Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.