SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
- URL: http://arxiv.org/abs/2502.06298v1
- Date: Mon, 10 Feb 2025 09:40:25 GMT
- Title: SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
- Authors: Chaoqun Liu, Wenxuan Zhang, Jiahao Ying, Mahani Aljunied, Anh Tuan Luu, Lidong Bing,
- Abstract summary: This study introduces two novel benchmarks, SeaExam and SeaBench, to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios.<n>Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions.
- Score: 72.93218369941734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.
Related papers
- IberBench: LLM Evaluation on Iberian Languages [2.3034630097498883]
Large Language Models (LLMs) are difficult to evaluate comprehensively, particularly for languages other than English.
We present IberBench, a benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks.
We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations.
arXiv Detail & Related papers (2025-04-23T17:48:25Z) - SEA-LION: Southeast Asian Languages in One Network [16.12423506306059]
We introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages.
The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer.
arXiv Detail & Related papers (2025-04-08T07:24:51Z) - Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks.
We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs.
These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z) - SEA-HELM: Southeast Asian Holistic Evaluation of Language Models [2.119348427296952]
SEA-HELM is a comprehensive and authentic evaluation suite for languages in the Southeast Asian (SEA) region.
It comprises five core pillars: (1) NLP Classics, (2) LLM-specifics, (3) SEA Linguistics, (4) SEA Culture, (5) Safety.
SEA-HELM currently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese.
arXiv Detail & Related papers (2025-02-20T06:32:45Z) - SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages [28.850331326601886]
We introduce Sail, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian languages (SEA)<n>Sail encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification)
arXiv Detail & Related papers (2024-12-02T06:42:51Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - SeaLLMs -- Large Language Models for Southeast Asia [76.50157503379086]
We introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages.
SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning.
Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities.
arXiv Detail & Related papers (2023-12-01T17:17:56Z) - MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks [12.665447518524187]
This study aims to perform a thorough evaluation of the non-English capabilities of SoTA LLMs by comparing them on the same set of multilingual datasets.
Our benchmark comprises 22 datasets covering 83 languages, including low-resource African languages.
We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks.
arXiv Detail & Related papers (2023-11-13T16:45:37Z) - Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction
Following: A Case Study of Arabic [1.0878040851638]
We employ GPT-4 as a uniform evaluator for both English and Arabic queries to assess and compare the performance of the LLMs on various open-ended tasks.
We find that fine-tuned base models using multilingual and multi-turn datasets could be competitive to models trained from scratch on multilingual data.
arXiv Detail & Related papers (2023-10-23T11:40:04Z) - Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis [103.89753784762445]
Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT)
This paper systematically investigates the advantages and challenges of LLMs for MMT.
We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4.
arXiv Detail & Related papers (2023-04-10T15:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.