PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task
Completion
- URL: http://arxiv.org/abs/2311.01767v2
- Date: Tue, 7 Nov 2023 10:13:34 GMT
- Title: PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task
Completion
- Authors: Yiduo Guo, Zekai Zhang, Yaobo Liang, Dongyan Zhao, Nan Duan
- Abstract summary: We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions.
We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence.
The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
- Score: 96.47420221442397
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent evaluations of Large Language Models (LLMs) have centered around
testing their zero-shot/few-shot capabilities for basic natural language tasks
and their ability to translate instructions into tool APIs. However, the
evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal
instructions in a complex multi-modal environment has not been investigated. To
address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark
to assess LLMs' ability to create and edit PPT files based on user
instructions. It contains 279 multi-turn sessions covering diverse topics and
hundreds of instructions involving multi-modal operations. We also propose the
PPTX-Match Evaluation System that evaluates if LLMs finish the instruction
based on the prediction file rather than the label API sequence, thus it
supports various LLM-generated API sequences. We measure 3 closed LLMs and 6
open-source LLMs. The results show that GPT-4 outperforms other LLMs with
75.1\% accuracy in single-turn dialogue testing but faces challenges in
completing entire sessions, achieving just 6\% session accuracy. We find three
main error causes in our benchmark: error accumulation in the multi-turn
session, long PPT template processing, and multi-modality perception. These
pose great challenges for future LLM and agent systems. We release the data,
code, and evaluation system of PPTC at \url{https://github.com/gydpku/PPTC}.
Related papers
- PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z) - Query-OPT: Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization [7.674972936853123]
We investigate whether combining the queries for the same input context in a single prompt to minimize repeated calls can be successfully used in meeting summarization.
We observe that 100% reliability in generating the response in the expected format is usually limited to certain closed-source LLMs.
arXiv Detail & Related papers (2024-02-29T19:00:47Z) - TAT-LLM: A Specialized Language Model for Discrete Reasoning over
Tabular and Textual Data [77.66158066013924]
We consider harnessing the amazing power of language models (LLMs) to solve our task.
We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z) - State of What Art? A Call for Multi-Prompt LLM Evaluation [28.307860675006545]
We comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances.
To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead.
arXiv Detail & Related papers (2023-12-31T22:21:36Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models.
We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level.
By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z) - CodeApex: A Bilingual Programming Evaluation Benchmark for Large
Language Models [43.655927559990616]
We propose CodeApex, a benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs.
We evaluate 12 widely used LLMs, including both general-purpose and specialized models.
GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively.
arXiv Detail & Related papers (2023-09-05T04:12:01Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.