SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
- URL: http://arxiv.org/abs/2502.14301v1
- Date: Thu, 20 Feb 2025 06:32:45 GMT
- Title: SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
- Authors: Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xian Bin Yong, Weiqi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, William Chandra Tjhi,
- Abstract summary: SEA-HELM is a comprehensive and authentic evaluation suite for languages in the Southeast Asian (SEA) region.
It comprises five core pillars: (1) NLP Classics, (2) LLM-specifics, (3) SEA Linguistics, (4) SEA Culture, (5) Safety.
SEA-HELM currently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese.
- Score: 2.119348427296952
- License:
- Abstract: With the rapid emergence of novel capabilities in Large Language Models (LLMs), the need for rigorous multilingual and multicultural benchmarks that are integrated has become more pronounced. Though existing LLM benchmarks are capable of evaluating specific capabilities of LLMs in English as well as in various mid- to low-resource languages, including those in the Southeast Asian (SEA) region, a comprehensive and authentic evaluation suite for the SEA languages has not been developed thus far. Here, we present SEA-HELM, a holistic linguistic and cultural LLM evaluation suite that emphasizes SEA languages, comprising five core pillars: (1) NLP Classics, (2) LLM-specifics, (3) SEA Linguistics, (4) SEA Culture, (5) Safety. SEA-HELM currently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese. We also introduce the SEA-HELM leaderboard, which allows users to understand models' multilingual and multicultural performance in a systematic and user-friendly manner.
Related papers
- Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs [61.117234373024665]
Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes.
Sailor2 undergoes continuous pre-training on 500B tokens to support 13 SEA languages while retaining proficiency in Chinese and English.
Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages.
arXiv Detail & Related papers (2025-02-18T16:04:57Z) - SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia [72.93218369941734]
This study introduces two novel benchmarks, SeaExam and SeaBench, to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios.
Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions.
arXiv Detail & Related papers (2025-02-10T09:40:25Z) - SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages [28.850331326601886]
We introduce Sail, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian languages (SEA)
Sail encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification)
arXiv Detail & Related papers (2024-12-02T06:42:51Z) - All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages [73.93600813999306]
ALM-bench is the largest and most comprehensive effort to date for evaluating LMMs across 100 languages.
It challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages.
The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions.
arXiv Detail & Related papers (2024-11-25T15:44:42Z) - Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus [0.9674145073701153]
We introduce Nemotron-Mini-Hindi 4B, a bilingual SLM supporting both Hindi and English.
We demonstrate that both the base and instruct models achieve state-of-the-art results on Hindi benchmarks.
arXiv Detail & Related papers (2024-10-18T18:35:19Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - SeaLLMs -- Large Language Models for Southeast Asia [76.50157503379086]
We introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages.
SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning.
Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities.
arXiv Detail & Related papers (2023-12-01T17:17:56Z) - BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation
Suite for Large Language Models [0.06597195879147556]
BHASA is a holistic linguistic and cultural evaluation suite for Large Language Models (LLMs) in Southeast Asian languages.
It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity.
arXiv Detail & Related papers (2023-09-12T09:31:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.