Related papers: Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs

Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs

URL: http://arxiv.org/abs/2505.18152v2
Date: Mon, 26 May 2025 17:52:36 GMT
Title: Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs
Authors: Wafa Alghallabi, Ritesh Thawkar, Sara Ghaboura, Ketan More, Omkar Thawakar, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer,
Abstract summary: We introduce emphFann or Flop, the first benchmark designed to assess the comprehension of Arabic poetry by large language models.<n>The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context.
Score: 32.247169514152425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Arabic poetry is one of the richest and most culturally rooted forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce \emph{Fann or Flop}, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in 12 historical eras, covering 14 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM understands classical Arabic through Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release "Fann or Flop" along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic language models. Code is available at: https://github.com/mbzuai-oryx/FannOrFlop.

Related papers

DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z)
Creation of a Numerical Scoring System to Objectively Measure and Compare the Level of Rhetoric in Arabic Texts: A Feasibility Study, and A Working Prototype [0.0]
The aim of this study was to devise a way to measure the density of the literary devices which constitute Arabic rhetoric in a given text.<n>A list of 84 of the commonest literary devices and their definitions was compiled.<n>A method of calculating the density of literary devices based on the morpheme count of the text was utilised.
arXiv Detail & Related papers (2025-07-08T13:01:46Z)
Poem Meter Classification of Recited Arabic Poetry: Integrating High-Resource Systems for a Low-Resource Task [6.562541994801405]
Arabic poetry has received major attention from linguistics over the decades.<n>Identifying meters for a verse is a lengthy and complicated process.<n>We propose a state-of-the-art framework to identify the poem meters of recited Arabic poetry.
arXiv Detail & Related papers (2025-04-16T15:25:45Z)
Arabizi vs LLMs: Can the Genie Understand the Language of Aladdin? [0.4751886527142778]
Arabizi is a hybrid form of Arabic that incorporates Latin characters and numbers.<n>It poses significant challenges for machine translation due to its lack of formal structure.<n>This research project investigates the model's performance in translating Arabizi into both Modern Standard Arabic and English.
arXiv Detail & Related papers (2025-02-28T11:37:52Z)
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.<n>One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.<n>Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z)
AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs [22.121471902726892]
We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation.<n>First-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions.
arXiv Detail & Related papers (2024-09-17T17:59:25Z)
Large Language Models for Classical Chinese Poetry Translation: Benchmarking, Evaluating, and Improving [43.148203559785095]
Large language models (LLMs) with impressive multilingual capabilities may bring a ray of hope to achieve this extreme translation demand.<n>This paper first introduces a suitable benchmark (PoetMT) where each Chinese poetry has a recognized elegant translation.<n>We propose a new metric based on GPT-4 to evaluate the extent to which current LLMs can meet these demands.
arXiv Detail & Related papers (2024-08-19T12:34:31Z)
When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models [59.84769254832941]
We propose a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp. Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment. Based on FLUB, we investigate the performance of multiple representative and advanced LLMs.
arXiv Detail & Related papers (2024-02-16T22:12:53Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
Ashaar: Automatic Analysis and Generation of Arabic Poetry Using Deep Learning Approaches [7.021140304091526]
This paper introduces a framework called textitAshaar, which encompasses a collection of datasets and pre-trained models designed specifically for the analysis and generation of Arabic poetry. The pipeline established within our proposed approach encompasses various aspects of poetry, such as meter, theme, and era classification. As part of this endeavor, we provide four datasets: one for poetry generation, another for diacritization, and two for Arudi-style prediction.
arXiv Detail & Related papers (2023-07-12T15:07:16Z)
CCPM: A Chinese Classical Poetry Matching Dataset [50.90794811956129]
We propose a novel task to assess a model's semantic understanding of poetry by poem matching. This task requires the model to select one line of Chinese classical poetry among four candidates according to the modern Chinese translation of a line of poetry. To construct this dataset, we first obtain a set of parallel data of Chinese classical poetry and modern Chinese translation.
arXiv Detail & Related papers (2021-06-03T16:49:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.