SailCompass: Towards Reproducible and Robust Evaluation for Southeast   Asian Languages
        - URL: http://arxiv.org/abs/2412.01186v1
 - Date: Mon, 02 Dec 2024 06:42:51 GMT
 - Title: SailCompass: Towards Reproducible and Robust Evaluation for Southeast   Asian Languages
 - Authors: Jia Guo, Longxu Dou, Guangtao Zeng, Stanley Kok, Wei Lu, Qian Liu, 
 - Abstract summary: We introduce Sail, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian languages (SEA)<n>Sail encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification)
 - Score: 28.850331326601886
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   In this paper, we introduce SailCompass, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian Languages (SEA). SailCompass encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification). To improve the robustness of the evaluation approach, we explore different prompt configurations for multiple-choice questions and leverage calibrations to improve the faithfulness of classification tasks. With SailCompass, we derive the following findings: (1) SEA-specialized LLMs still outperform general LLMs, although the gap has narrowed; (2) A balanced language distribution is important for developing better SEA-specialized LLMs; (3) Advanced prompting techniques (e.g., calibration, perplexity-based ranking) are necessary to better utilize LLMs. All datasets and evaluation scripts are public. 
 
       
      
        Related papers
        - mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text   Tasks [11.996399504336624]
We introduce mSTEB, a new benchmark to evaluate the performance of large language models (LLMs) on a wide range of tasks.<n>We evaluate the performance of leading LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B.
arXiv  Detail & Related papers  (2025-06-10T03:15:08Z) - Evaluating Large Language Model with Knowledge Oriented Language   Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv  Detail & Related papers  (2025-05-22T12:27:02Z) - PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.
Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv  Detail & Related papers  (2025-04-25T15:39:04Z) - IberBench: LLM Evaluation on Iberian Languages [2.3034630097498883]
Large Language Models (LLMs) are difficult to evaluate comprehensively, particularly for languages other than English.
We present IberBench, a benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks.
We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations.
arXiv  Detail & Related papers  (2025-04-23T17:48:25Z) - Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on   African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks.
We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs.
These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv  Detail & Related papers  (2025-03-06T05:15:34Z) - SEA-HELM: Southeast Asian Holistic Evaluation of Language Models [2.119348427296952]
SEA-HELM is a comprehensive and authentic evaluation suite for languages in the Southeast Asian (SEA) region.
It comprises five core pillars: (1) NLP Classics, (2) LLM-specifics, (3) SEA Linguistics, (4) SEA Culture, (5) Safety.
SEA-HELM currently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese.
arXiv  Detail & Related papers  (2025-02-20T06:32:45Z) - SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual   Questions in Southeast Asia [72.93218369941734]
This study introduces two novel benchmarks, SeaExam and SeaBench, to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios.
Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions.
arXiv  Detail & Related papers  (2025-02-10T09:40:25Z) - ProverbEval: Exploring LLM Evaluation Challenges for Low-resource   Language Understanding [15.93642619347214]
We introduce ProverbEval, an evaluation benchmark for low-resource languages based on proverbs.
We benchmark various LLMs and explore factors that create variability in the benchmarking process.
We argue special attention must be given to the order of choices, choice of prompt language, task variability, and generation tasks.
arXiv  Detail & Related papers  (2024-11-07T06:34:48Z) - Think Carefully and Check Again! Meta-Generation Unlocking LLMs for   Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv  Detail & Related papers  (2024-10-26T00:39:44Z) - Language Imbalance Driven Rewarding for Multilingual Self-improving [35.1576728251478]
Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks.
This imbalance, while limiting broader applications, generates a natural preference ranking between languages.
We propose $textitLanguage Imbalance Driven Rewarding$, where the inherent imbalance between dominant and non-dominant languages is leveraged as a reward signal.
arXiv  Detail & Related papers  (2024-10-11T16:32:05Z) - Analyzing and Adapting Large Language Models for Few-Shot Multilingual
  NLU: Are We There Yet? [82.02076369811402]
Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning.
We present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups.
Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements.
arXiv  Detail & Related papers  (2024-03-04T10:48:13Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by   Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv  Detail & Related papers  (2024-02-29T21:05:37Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
  Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv  Detail & Related papers  (2023-12-26T18:38:54Z) - AlignBench: Benchmarking Chinese Alignment of Large Language Models [99.24597941555277]
We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment.
We design a human-in-the-loop data curation pipeline, containing eight main categories, 683 real-scenario rooted queries and corresponding human verified references.
For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judgecitezheng2023judging approach with Chain-of-Thought to generate explanations and final ratings.
arXiv  Detail & Related papers  (2023-11-30T17:41:30Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
  Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv  Detail & Related papers  (2023-08-15T17:40:34Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.