NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark
Dataset for Generative Language Models in Norwegian
- URL: http://arxiv.org/abs/2312.01314v1
- Date: Sun, 3 Dec 2023 08:09:45 GMT
- Title: NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark
Dataset for Generative Language Models in Norwegian
- Authors: Peng Liu, Lemei Zhang, Terje Nissen Farup, Even W. Lauvrak, Jon Espen
Ingvaldsen, Simen Eide, Jon Atle Gulla and Zhirong Yang
- Abstract summary: We introduce NLEBench, a benchmark for evaluating natural language generation capabilities in Norwegian, a low-resource language.
NLEBench encompasses a suite of real-world NLP tasks ranging from news storytelling, summarization, open-domain conversation, natural language understanding, instruction fine-tuning, toxicity and bias evaluation, to self-curated Chain-of-Thought investigation.
This paper also introduces foundational Norwegian Generative Language Models (NorGLMs) developed with diverse parameter scales and Transformer-based architectures.
- Score: 4.236983772147863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in Generative Language Models (GLMs) have transformed
Natural Language Processing (NLP) by showcasing the effectiveness of the
"pre-train, prompt, and predict" paradigm in utilizing pre-trained GLM
knowledge for diverse applications. Despite their potential, these capabilities
lack adequate quantitative characterization due to the absence of comprehensive
benchmarks, particularly for low-resource languages. Existing low-resource
benchmarks focus on discriminative language models like BERT, neglecting the
evaluation of generative language models. Moreover, current benchmarks often
overlook measuring generalization performance across multiple tasks, a crucial
metric for GLMs.
To bridge these gaps, we introduce NLEBench, a comprehensive benchmark
tailored for evaluating natural language generation capabilities in Norwegian,
a low-resource language. We use Norwegian as a case study to explore whether
current GLMs and benchmarks in mainstream languages like English can reveal the
unique characteristics of underrepresented languages. NLEBench encompasses a
suite of real-world NLP tasks ranging from news storytelling, summarization,
open-domain conversation, natural language understanding, instruction
fine-tuning, toxicity and bias evaluation, to self-curated Chain-of-Thought
investigation. It features two high-quality, human-annotated datasets: an
instruction dataset covering traditional Norwegian cultures, idioms, slang, and
special expressions, and a document-grounded multi-label dataset for topic
classification, question answering, and summarization. This paper also
introduces foundational Norwegian Generative Language Models (NorGLMs)
developed with diverse parameter scales and Transformer-based architectures.
Systematic evaluations on the proposed benchmark suite provide insights into
the capabilities and scalability of NorGLMs across various downstream tasks.
Related papers
- Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages.
This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - NLPre: a revised approach towards language-centric benchmarking of Natural Language Preprocessing systems [2.141587359797428]
It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-based morphological analysers or dictionaries.
Inspired by the GLUE benchmark, the proposed language-centric benchmarking system enables comprehensive ongoing evaluation of multiple NLPre tools.
The prototype application is configured for Polish and integrated with the thoroughly assembled NLPre-PL benchmark.
arXiv Detail & Related papers (2024-03-07T14:07:00Z) - High-quality Data-to-Text Generation for Severely Under-Resourced
Languages with Out-of-the-box Large Language Models [5.632410663467911]
We explore the extent to which pretrained large language models (LLMs) can bridge the performance gap for under-resourced languages.
We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins.
For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English.
arXiv Detail & Related papers (2024-02-19T16:29:40Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Jamp: Controlled Japanese Temporal Inference Dataset for Evaluating
Generalization Capacity of Language Models [18.874880342410876]
We present Jamp, a Japanese benchmark focused on temporal inference.
Our dataset includes a range of temporal inference patterns, which enables us to conduct fine-grained analysis.
We evaluate the generalization capacities of monolingual/multilingual LMs by splitting our dataset based on tense fragments.
arXiv Detail & Related papers (2023-06-19T07:00:14Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z) - GLGE: A New General Language Generation Evaluation Benchmark [139.25515221280767]
General Language Generation Evaluation (GLGE) is a new multi-task benchmark for evaluating the generalization capabilities of NLG models.
To encourage research on pretraining and transfer learning on NLG models, we make GLGE publicly available and build a leaderboard with strong baselines.
arXiv Detail & Related papers (2020-11-24T06:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.