BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism
- URL: http://arxiv.org/abs/2412.17933v2
- Date: Thu, 22 May 2025 08:47:46 GMT
- Title: BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism
- Authors: Martin Fajcik, Martin Docekal, Jan Dolezal, Karel Ondrej, Karel Beneš, Jan Kapsa, Pavel Smrz, Alexander Polok, Michal Hradis, Zuzana Neverilova, Ales Horak, Radoslav Sabol, Michal Stefanik, Adam Jirkovsky, David Adamczyk, Petr Hyner, Jan Hula, Hynek Kydlicek,
- Abstract summary: BenCzechMark (BCM) is the first comprehensive Czech language benchmark designed for large language models.<n>Our benchmark encompasses 50 challenging tasks, with corresponding test datasets, primarily in native Czech, with 14 newly collected ones.<n>These tasks span 8 categories and cover diverse domains, including historical Czech news, essays from pupils or language learners, and spoken word.
- Score: 30.267465719961585
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present BenCzechMark (BCM), the first comprehensive Czech language benchmark designed for large language models, offering diverse tasks, multiple task formats, and multiple evaluation metrics. Its duel scoring system is grounded in statistical significance theory and uses aggregation across tasks inspired by social preference theory. Our benchmark encompasses 50 challenging tasks, with corresponding test datasets, primarily in native Czech, with 14 newly collected ones. These tasks span 8 categories and cover diverse domains, including historical Czech news, essays from pupils or language learners, and spoken word. Furthermore, we collect and clean BUT-Large Czech Collection, the largest publicly available clean Czech language corpus, and use it for (i) contamination analysis and (ii) continuous pretraining of the first Czech-centric 7B language model with Czech-specific tokenization. We use our model as a baseline for comparison with publicly available multilingual models. Lastly, we release and maintain a leaderboard with existing 50 model submissions, where new model submissions can be made at https://huggingface.co/spaces/CZLC/BenCzechMark.
Related papers
- MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis [1.5761916307614148]
We propose the first benchmark of sentence embeddings for French.
We compare 51 carefully selected embedding models on a large scale.
We find that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform exceptionally well.
arXiv Detail & Related papers (2024-05-30T20:34:37Z) - DeMuX: Data-efficient Multilingual Learning [57.37123046817781]
DEMUX is a framework that prescribes exact data-points to label from vast amounts of unlabelled multilingual data.
Our end-to-end framework is language-agnostic, accounts for model representations, and supports multilingual target configurations.
arXiv Detail & Related papers (2023-11-10T20:09:08Z) - A Dataset and Strong Baselines for Classification of Czech News Texts [0.0]
We present CZEchNEwsClassificationdataset (CZE-NEC), one of the largest Czech classification datasets.
We define four classification tasks: news source, news category, inferred author's gender, and day of the week.
We show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
arXiv Detail & Related papers (2023-07-20T07:47:08Z) - BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format.
BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer.
Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Czech Dataset for Cross-lingual Subjectivity Classification [13.70633147306388]
We introduce a new Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions.
Two annotators annotated the dataset reaching 0.83 of the Cohen's kappa inter-annotator agreement.
We fine-tune five pre-trained BERT-like models to set a monolingual baseline for the new dataset and we achieve 93.56% of accuracy.
arXiv Detail & Related papers (2022-04-29T07:31:46Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Are Multilingual Models the Best Choice for Moderately Under-resourced
Languages? A Comprehensive Assessment for Catalan [0.05277024349608833]
This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models.
We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings.
The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
arXiv Detail & Related papers (2021-07-16T13:52:01Z) - Czert -- Czech BERT-like Model for Language Representation [0.0]
This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures.
We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data.
arXiv Detail & Related papers (2021-03-24T07:27:28Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - Reading Comprehension in Czech via Machine Translation and Cross-lingual
Transfer [2.8273701718153563]
This work focuses on building reading comprehension systems for Czech, without requiring any manually annotated Czech training data.
We automatically translated SQuAD 1.1 and SQuAD 2.0 datasets to Czech to create training and development data.
We then trained and evaluated several BERT and XLM-RoBERTa baseline models.
arXiv Detail & Related papers (2020-07-03T13:09:37Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.