SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
- URL: http://arxiv.org/abs/2502.06298v1
- Date: Mon, 10 Feb 2025 09:40:25 GMT
- Title: SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
- Authors: Chaoqun Liu, Wenxuan Zhang, Jiahao Ying, Mahani Aljunied, Anh Tuan Luu, Lidong Bing,
- Abstract summary: This study introduces two novel benchmarks, SeaExam and SeaBench, to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios.
Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions.
- Score: 72.93218369941734
- License:
- Abstract: This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.
Related papers
- SEA-HELM: Southeast Asian Holistic Evaluation of Language Models [2.119348427296952]
SEA-HELM is a comprehensive and authentic evaluation suite for languages in the Southeast Asian (SEA) region.
It comprises five core pillars: (1) NLP Classics, (2) LLM-specifics, (3) SEA Linguistics, (4) SEA Culture, (5) Safety.
SEA-HELM currently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese.
arXiv Detail & Related papers (2025-02-20T06:32:45Z) - SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages [28.850331326601886]
We introduce Sail, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian languages (SEA)
Sail encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification)
arXiv Detail & Related papers (2024-12-02T06:42:51Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages.
SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z) - Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT [9.682499180341273]
Large language models (LLMs) have significantly advanced text generation, but the human-like quality of their outputs presents major challenges.
We propose CUDRT, a comprehensive evaluation framework and bilingual benchmark in Chinese and English.
This framework supports scalable, reproducible experiments and enables analysis of how operational diversity, multilingual training sets, and LLM architectures influence detection performance.
arXiv Detail & Related papers (2024-06-13T12:43:40Z) - SeaLLMs -- Large Language Models for Southeast Asia [76.50157503379086]
We introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages.
SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning.
Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities.
arXiv Detail & Related papers (2023-12-01T17:17:56Z) - Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction
Following: A Case Study of Arabic [1.0878040851638]
We employ GPT-4 as a uniform evaluator for both English and Arabic queries to assess and compare the performance of the LLMs on various open-ended tasks.
We find that fine-tuned base models using multilingual and multi-turn datasets could be competitive to models trained from scratch on multilingual data.
arXiv Detail & Related papers (2023-10-23T11:40:04Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.