BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation
Suite for Large Language Models
- URL: http://arxiv.org/abs/2309.06085v2
- Date: Tue, 19 Sep 2023 03:44:17 GMT
- Title: BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation
Suite for Large Language Models
- Authors: Wei Qi Leong, Jian Gang Ngui, Yosephine Susanto, Hamsawardhini
Rengarajan, Kengatharaiyer Sarveswaran, William Chandra Tjhi
- Abstract summary: BHASA is a holistic linguistic and cultural evaluation suite for Large Language Models (LLMs) in Southeast Asian languages.
It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity.
- Score: 0.06597195879147556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid development of Large Language Models (LLMs) and the emergence of
novel abilities with scale have necessitated the construction of holistic,
diverse and challenging benchmarks such as HELM and BIG-bench. However, at the
moment, most of these benchmarks focus only on performance in English and
evaluations that include Southeast Asian (SEA) languages are few in number. We
therefore propose BHASA, a holistic linguistic and cultural evaluation suite
for LLMs in SEA languages. It comprises three components: (1) a NLP benchmark
covering eight tasks across Natural Language Understanding (NLU), Generation
(NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit
that spans the gamut of linguistic phenomena including syntax, semantics and
pragmatics, and (3) a cultural diagnostics dataset that probes for both
cultural representation and sensitivity. For this preliminary effort, we
implement the NLP benchmark only for Indonesian, Vietnamese, Thai and Tamil,
and we only include Indonesian and Tamil for LINDSEA and the cultural
diagnostics dataset. As GPT-4 is purportedly one of the best-performing
multilingual LLMs at the moment, we use it as a yardstick to gauge the
capabilities of LLMs in the context of SEA languages. Our initial experiments
on GPT-4 with BHASA find it lacking in various aspects of linguistic
capabilities, cultural representation and sensitivity in the targeted SEA
languages. BHASA is a work in progress and will continue to be improved and
expanded in the future. The repository for this paper can be found at:
https://github.com/aisingapore/BHASA
Related papers
- All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages [73.93600813999306]
ALM-bench is the largest and most comprehensive effort to date for evaluating LMMs across 100 languages.
It challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages.
The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions.
arXiv Detail & Related papers (2024-11-25T15:44:42Z) - MILU: A Multi-task Indic Language Understanding Benchmark [7.652738829153342]
Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing Large Language Models in Indic languages.
We introduce MILU, a comprehensive evaluation benchmark designed to address this gap.
With an India-centric design, MILU incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics.
arXiv Detail & Related papers (2024-11-04T19:17:17Z) - Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages [55.36534539177367]
This paper introduces Pangea, a multilingual multimodal large language model (MLLM) trained on a diverse 6M instruction dataset spanning 39 languages.
P Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts.
We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs.
arXiv Detail & Related papers (2024-10-21T16:19:41Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages [12.514648269553104]
IndicGenBench is the largest benchmark for evaluating large language models (LLMs)
It is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering.
The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English.
arXiv Detail & Related papers (2024-04-25T17:57:36Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models [79.46179534911019]
Large language models (LLMs) have demonstrated multilingual capabilities; yet, they are mostly English-centric due to imbalanced training corpora.
This work extends the evaluation from NLP tasks to real user queries.
For culture-related tasks that need deep language understanding, prompting in the native language tends to be more promising.
arXiv Detail & Related papers (2024-03-15T12:47:39Z) - Teaching Large Language Models an Unseen Language on the Fly [32.83773919852362]
We introduce DiPMT++, a framework for adapting LLMs to unseen languages by in-context learning.
Using a dictionary and 5K parallel sentences only, DiPMT++ significantly enhances the performance of GPT-4 from 0 to 16 BLEU for Chinese-to-Zhuang translation.
We also validate the effectiveness of our framework on Kalamang, another unseen language.
arXiv Detail & Related papers (2024-02-29T13:50:47Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - ParsiNLU: A Suite of Language Understanding Challenges for Persian [23.26176232463948]
This work focuses on Persian language, one of the widely spoken languages in the world.
There are few NLU datasets available for this rich language.
ParsiNLU is the first benchmark in Persian language that includes a range of high-level tasks.
arXiv Detail & Related papers (2020-12-11T06:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.