Related papers: FilBench: Can LLMs Understand and Generate Filipino?

FilBench: Can LLMs Understand and Generate Filipino?

URL: http://arxiv.org/abs/2508.03523v1
Date: Tue, 05 Aug 2025 14:48:32 GMT
Title: FilBench: Can LLMs Understand and Generate Filipino?
Authors: Lester James V. Miranda, Elyanah Aco, Conner Manuel, Jan Christian Blaise Cruz, Joseph Marvin Imperial,
Abstract summary: FilBench is a Filipino-centric benchmark designed to evaluate LLMs across a diverse set of tasks and capabilities in Filipino, Tagalog, and Cebuano.<n>By evaluating 27 state-of-the-art LLMs on FilBench, we find that several LLMs suffer from reading comprehension and translation capabilities.<n>Our work demonstrates the value of curating language-specific benchmarks to aid in driving progress on Filipino NLP.
Score: 2.029906424353094
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the impressive performance of LLMs on English-based tasks, little is known about their capabilities in specific languages such as Filipino. In this work, we address this gap by introducing FilBench, a Filipino-centric benchmark designed to evaluate LLMs across a diverse set of tasks and capabilities in Filipino, Tagalog, and Cebuano. We carefully curate the tasks in FilBench to reflect the priorities and trends of NLP research in the Philippines such as Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation. By evaluating 27 state-of-the-art LLMs on FilBench, we find that several LLMs suffer from reading comprehension and translation capabilities. Our results indicate that FilBench is challenging, with the best model, GPT-4o, achieving only a score of 72.23%. Moreover, we also find that models trained specifically for Southeast Asian languages tend to underperform on FilBench, with the highest-performing model, SEA-LION v3 70B, achieving only a score of 61.07%. Our work demonstrates the value of curating language-specific LLM benchmarks to aid in driving progress on Filipino NLP and increasing the inclusion of Philippine languages in LLM development.

Related papers

FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM) [0.873811641236639]
This study presents FiLLM, a Filipino-optimized large language model, designed to enhance natural language processing (NLP) capabilities in the Filipino language.<n>Built upon the SeaLLM-7B 2.5 model, FiLLM leverages Low-Rank Adaptation (LoRA) fine-tuning to optimize memory efficiency while maintaining task-specific performance.<n>The model was trained and evaluated on diverse Filipino datasets to address key NLP tasks, including Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Dependency Parsing, and Text Summarization.
arXiv Detail & Related papers (2025-05-25T06:36:26Z)
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.<n>Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z)
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [86.7047714187813]
MMLU-ProX is a benchmark covering 29 languages, built on an English benchmark.<n>Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons.<n>To meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
arXiv Detail & Related papers (2025-03-13T15:59:20Z)
Batayan: A Filipino NLP benchmark for evaluating Large Language Models [0.0]
Batayan consolidates eight tasks, three of which have not existed prior for Filipino corpora.<n>Our rigorous, native-speaker-driven adaptation and validation processes ensures fluency and authenticity to the complex morphological and syntactic structures of Filipino.<n>We propose principled solutions for building culturally and linguistically-faithful resources in under-represented languages.
arXiv Detail & Related papers (2025-02-19T07:03:15Z)
Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino [8.305146753192858]
We introduce Kalahi, a cultural LLM evaluation suite collaboratively created by native Filipino speakers.<n>Strong LLM performance in Kalahi indicates a model's ability to generate responses similar to what an average Filipino would say or do in a given situation.
arXiv Detail & Related papers (2024-09-20T15:01:21Z)
Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages [48.40607157158246]
Large Language Models (LLMs) perform better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate.<n>We propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations.<n>Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores.
arXiv Detail & Related papers (2024-04-17T16:53:16Z)
D\'olares or Dollars? Unraveling the Bilingual Prowess of Financial LLMs Between Spanish and English [67.48541936784501]
Tois'on de Oro is the first framework that establishes instruction datasets, finetuned LLMs, and evaluation benchmark for financial LLMs in Spanish joint with English. We construct a rigorously curated bilingual instruction dataset including over 144K Spanish and English samples from 15 datasets covering 7 tasks. We evaluate our model and existing LLMs using FLARE-ES, the first comprehensive bilingual evaluation benchmark with 21 datasets covering 9 tasks.
arXiv Detail & Related papers (2024-02-12T04:50:31Z)
Large Language Models: A Survey [66.39828929831017]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.<n>LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z)
BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP [17.362068473064717]
Large Language Models (LLMs) have emerged as one of the most important breakthroughs in NLP. This paper introduces BenLLM-Eval, which consists of a comprehensive evaluation of LLMs to benchmark their performance in the Bengali language. Our experimental results demonstrate that while in some Bengali NLP tasks, zero-shot LLMs could achieve performance on par, or even better than current SOTA fine-tuned models.
arXiv Detail & Related papers (2023-09-22T20:29:34Z)
CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z)
Don't Trust ChatGPT when Your Question is not in English: A Study of Multilingual Abilities and Types of LLMs [16.770697902481107]
Large Language Models (LLMs) have demonstrated exceptional natural language understanding abilities. We propose a systematic way of qualifying the performance disparities of LLMs under multilingual settings. The results show that GPT exhibits highly translating-like behaviour in multilingual settings.
arXiv Detail & Related papers (2023-05-24T02:05:03Z)
Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis [103.89753784762445]
Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT) This paper systematically investigates the advantages and challenges of LLMs for MMT. We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4.
arXiv Detail & Related papers (2023-04-10T15:51:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.