Related papers: AraTable: Benchmarking LLMs' Reasoning and Understanding of Arabic Tabular Data

AraTable: Benchmarking LLMs' Reasoning and Understanding of Arabic Tabular Data

URL: http://arxiv.org/abs/2507.18442v1
Date: Thu, 24 Jul 2025 14:26:41 GMT
Title: AraTable: Benchmarking LLMs' Reasoning and Understanding of Arabic Tabular Data
Authors: Rana Alshaikh, Israa Alghanmi, Shelan Jeawak,
Abstract summary: We present AraTable, a benchmark designed to evaluate the reasoning and understanding capabilities of large language models when applied to Arabic data.<n>AraTable consists of various evaluation tasks, such as direct question answering, fact verification, and complex reasoning.<n>We propose a fully automated evaluation framework that uses a self-deliberation mechanism and achieves performance nearly identical to that of human judges.
Score: 2.9631016562930546
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The cognitive and reasoning abilities of large language models (LLMs) have enabled remarkable progress in natural language processing. However, their performance in interpreting structured data, especially in tabular formats, remains limited. Although benchmarks for English tabular data are widely available, Arabic is still underrepresented because of the limited availability of public resources and its unique language features. To address this gap, we present AraTable, a novel and comprehensive benchmark designed to evaluate the reasoning and understanding capabilities of LLMs when applied to Arabic tabular data. AraTable consists of various evaluation tasks, such as direct question answering, fact verification, and complex reasoning, involving a wide range of Arabic tabular sources. Our methodology follows a hybrid pipeline, where initial content is generated by LLMs and subsequently filtered and verified by human experts to ensure high dataset quality. Initial analyses using AraTable show that, while LLMs perform adequately on simpler tabular tasks such as direct question answering, they continue to face significant cognitive challenges when tasks require deeper reasoning and fact verification. This indicates that there are substantial opportunities for future work to improve performance on complex tabular reasoning tasks. We also propose a fully automated evaluation framework that uses a self-deliberation mechanism and achieves performance nearly identical to that of human judges. This research provides a valuable, publicly available resource and evaluation framework that can help accelerate the development of foundational models for processing and analysing Arabic structured data.

Related papers

TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models [30.26407735827857]
Reasoning with table-structured data poses significant challenges for large language models (LLMs)<n>We present a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities.<n>We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT.
arXiv Detail & Related papers (2025-06-23T09:02:04Z)
RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models [9.211266032947497]
We demonstrate that fact retrieval is substantially more difficult than isolated point-wise queries.<n>Our experiments reveal that even stateofthe-art LLMs struggle significantly, not exceeding 25% factual accuracy.<n>These findings underscore limitations in current LLMs' ability to synthesize structured factual knowledge.
arXiv Detail & Related papers (2025-05-27T16:33:38Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables [32.9031799179503]
textscNeedleInATable (NIAT) treats each table cell as a needle'' and requires models to extract the target cell based on cell locations or lookup questions.<n>Our data, code and models will be released to facilitate future research.
arXiv Detail & Related papers (2025-04-09T03:46:56Z)
Benchmarking Table Comprehension In The Wild [9.224698222634789]
TableQuest is a new benchmark designed to evaluate the holistic table comprehension capabilities of Large Language Models (LLMs)<n>We experiment with 7 state-of-the-art models, and find that despite reasonable accuracy in locating facts, they often falter when required to execute more sophisticated reasoning or multi-step calculations.
arXiv Detail & Related papers (2024-12-13T05:52:37Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
Table Question Answering for Low-resourced Indic Languages [71.57359949962678]
TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output. We introduce a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget. We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models.
arXiv Detail & Related papers (2024-10-04T16:26:12Z)
INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages [25.402797722575805]
Indic QA Benchmark is a dataset for context grounded question answering in 11 major Indian languages.<n> Evaluations revealed weak performance in low resource languages due to a strong English language bias in their training data.<n>We also investigated the Translate Test paradigm,where inputs are translated to English for processing and the results are translated back into the source language for output.
arXiv Detail & Related papers (2024-07-18T13:57:16Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis [7.486549276995143]
Large Language Models (LLMs) have been shown to tackle table comprehension tasks without specific training.<n>We probe how in-context learning (ICL), model scale, instruction tuning, and domain bias affect Tabular QA (TQA)<n>We reveal a strong correlation between perturbation-induced shifts in attention dispersion and the drops in performance.
arXiv Detail & Related papers (2024-06-18T15:41:15Z)
Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL [0.0]
We utilize OpenAI's GPT-4 model within a retrieval-augmented generation (RAG) framework. Business context document is enriched with a business context document, to transform NLQs into Structured Query Language queries. Performance achieved a maximum of 85% when high complexity queries are excluded.
arXiv Detail & Related papers (2024-06-15T17:07:31Z)
From Multiple-Choice to Extractive QA: A Case Study for English and Arabic [51.13706104333848]
We explore the feasibility of repurposing an existing multilingual dataset for a new NLP task.<n>We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic.<n>We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions. We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning. We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.