Related papers: KLEJ: Comprehensive Benchmark for Polish Language Understanding

KLEJ: Comprehensive Benchmark for Polish Language Understanding

URL: http://arxiv.org/abs/2005.00630v1
Date: Fri, 1 May 2020 21:55:40 GMT
Title: KLEJ: Comprehensive Benchmark for Polish Language Understanding
Authors: Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik
Abstract summary: We introduce a comprehensive multi-task benchmark for the Polish language understanding, accompanied by an online leaderboard. We also release HerBERT, a Transformer-based model trained specifically for the Polish language, which has the best average performance and obtains the best results for three out of nine tasks.
Score: 4.702729080310267
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, a series of Transformer-based models unlocked major improvements in general natural language understanding (NLU) tasks. Such a fast pace of research would not be possible without general NLU benchmarks, which allow for a fair comparison of the proposed methods. However, such benchmarks are available only for a handful of languages. To alleviate this issue, we introduce a comprehensive multi-task benchmark for the Polish language understanding, accompanied by an online leaderboard. It consists of a diverse set of tasks, adopted from existing datasets for named entity recognition, question-answering, textual entailment, and others. We also introduce a new sentiment analysis task for the e-commerce domain, named Allegro Reviews (AR). To ensure a common evaluation scheme and promote models that generalize to different NLU tasks, the benchmark includes datasets from varying domains and applications. Additionally, we release HerBERT, a Transformer-based model trained specifically for the Polish language, which has the best average performance and obtains the best results for three out of nine tasks. Finally, we provide an extensive evaluation, including several standard baselines and recently proposed, multilingual Transformer-based models.

Related papers

Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis [4.061135251278187]
TrGLUE is a benchmark for evaluating natural language understanding in Turkish.<n>We also present SentiTurca, a specialized benchmark for sentiment analysis.<n> TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations.
arXiv Detail & Related papers (2025-12-26T18:02:09Z)
MMTEB: Massive Multilingual Text Embedding Benchmark [85.18187649328792]
We introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) MMTEB covers over 500 quality-controlled evaluation tasks across 250+ languages. We develop several highly multilingual benchmarks, which we use to evaluate a representative set of models.
arXiv Detail & Related papers (2025-02-19T10:13:43Z)
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models [89.13128402847943]
We present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. We introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages.
arXiv Detail & Related papers (2025-01-01T15:43:07Z)
Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks [0.0]
We introduce twelve new datasets, creating a new benchmark for evaluating the performance of models across a diverse set of Natural Language Understanding (NLU) tasks.<n>The added tasks include Single-Sentence Classification, Similarity and Paraphrase Tasks, Natural Language Inference (NLI), and General Masked Evaluation Task (GMET)
arXiv Detail & Related papers (2024-11-28T16:32:02Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks. We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z)
MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis [1.5761916307614148]
We propose the first benchmark of sentence embeddings for French. We compare 51 carefully selected embedding models on a large scale. We find that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform exceptionally well.
arXiv Detail & Related papers (2024-05-30T20:34:37Z)
DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties. This allows for a comprehensive evaluation of NLP system performance on different language varieties. We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z)
MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue [115.32009638844059]
We extend the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages. Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals. We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the Natural Language Understanding tasks of intent detection and slot labelling.
arXiv Detail & Related papers (2022-12-20T17:34:25Z)
This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish [5.8090623549313944]
We introduce LEPISZCZE, a new, comprehensive benchmark for Polish NLP. We use five datasets from the Polish benchmark and add eight novel datasets. We provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
arXiv Detail & Related papers (2022-11-23T16:51:09Z)
Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish [0.0]
The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons. We prepared a new dataset entirely in Polish, which allows comparing the effectiveness of the pre-trained models.
arXiv Detail & Related papers (2022-05-31T12:00:05Z)
Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages [15.373725507698591]
We argue that this makes the existing practices in multilingual evaluation unreliable and does not provide a full picture of the performance of MMLMs across the linguistic landscape. We propose that the recent work done in Performance Prediction for NLP tasks can serve as a potential solution in fixing benchmarking in Multilingual NLP. We compare performance prediction with translating test data with a case study on four different multilingual datasets, and observe that these methods can provide reliable estimates of the performance that are often on-par with the translation based approaches.
arXiv Detail & Related papers (2022-05-12T20:42:48Z)
XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.