Large Language Models for Spreadsheets: Benchmarking Progress and Evaluating Performance with FLARE
- URL: http://arxiv.org/abs/2506.17330v1
- Date: Thu, 19 Jun 2025 03:47:38 GMT
- Title: Large Language Models for Spreadsheets: Benchmarking Progress and Evaluating Performance with FLARE
- Authors: Simon Thorne,
- Abstract summary: Large Language Models (LLMs) have demonstrated some significant capabilities across various domains.<n>This study introduces a benchmark framework to evaluate the performance of leading LLMs in executing spreadsheet functions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated some significant capabilities across various domains; however, their effectiveness in spreadsheet related tasks remains underexplored. This study introduces a foundation for a comprehensive benchmark framework to evaluate the performance of leading LLMs in executing spreadsheet functions, formula generation and data manipulation tasks. The benchmark encompasses tasks ranging from basic formula creation to complex, real world spreadsheet scenarios. Our findings reveal that while LLMs exhibit proficiency in straightforward tasks, they often falter in complex, multi step operations, frequently producing plausible yet incorrect outputs. These results underscore the limitations of current LLMs in handling spreadsheet tasks that require precise logical reasoning and highlight the need for integrating symbolic reasoning capabilities into LLM architectures. To support this, we introduce FLARE (Formula Logic, Auditing, Reasoning and Evaluation) a new benchmark for evaluating LLM performance on real-world spreadsheet logic, auditing, and reasoning tasks.
Related papers
- EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models [65.48902212293903]
We present the Extremely Complex Instruction Following Benchmark (EIFBENCH) for evaluating large language models (LLMs)<n>EIFBENCH includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently.<n>We also propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow.
arXiv Detail & Related papers (2025-06-10T02:39:55Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - TuRTLe: A Unified Evaluation of LLMs for RTL Generation [0.6010802600885173]
We propose TuRTLe, a unified evaluation framework designed to assess LLMs across key RTL generation tasks.<n>We benchmark a diverse set of open LLMs and analyze their strengths and weaknesses in EDA-specific tasks.<n>Our results show that reasoning-based models, such as DeepSeek R1, consistently outperform others across multiple evaluation criteria.
arXiv Detail & Related papers (2025-03-31T07:43:12Z) - Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making [85.24399869971236]
We aim to evaluate Large Language Models (LLMs) for embodied decision making.<n>Existing evaluations tend to rely solely on a final success rate.<n>We propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks.
arXiv Detail & Related papers (2024-10-09T17:59:00Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Evaluating LLMs' Mathematical Reasoning in Financial Document Question
Answering [53.56653281752486]
This study explores Large Language Models' mathematical reasoning on four financial question-answering datasets.
We focus on sensitivity to table complexity and performance variations with an increasing number of arithmetic reasoning steps.
We introduce a novel prompting technique tailored to semi-structured documents, matching or outperforming other baselines in performance.
arXiv Detail & Related papers (2024-02-17T05:10:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.