PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion
- URL: http://arxiv.org/abs/2403.03788v1
- Date: Wed, 6 Mar 2024 15:33:32 GMT
- Title: PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion
- Authors: Zekai Zhang, Yiduo Guo, Yaobo Liang, Dongyan Zhao, Nan Duan
- Abstract summary: We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
- Score: 96.47420221442397
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing dependence on Large Language Models (LLMs) for finishing user
instructions necessitates a comprehensive understanding of their robustness to
complex task completion in real-world situations. To address this critical
need, we propose the PowerPoint Task Completion Robustness benchmark (PPTC-R)
to measure LLMs' robustness to the user PPT task instruction and software
version. Specifically, we construct adversarial user instructions by attacking
user instructions at sentence, semantic, and multi-language levels. To assess
the robustness of Language Models to software versions, we vary the number of
provided APIs to simulate both the newest version and earlier version settings.
Subsequently, we test 3 closed-source and 4 open-source LLMs using a benchmark
that incorporates these robustness settings, aiming to evaluate how deviations
impact LLMs' API calls for task completion. We find that GPT-4 exhibits the
highest performance and strong robustness in our benchmark, particularly in the
version update and the multilingual settings. However, we find that all LLMs
lose their robustness when confronted with multiple challenges (e.g.,
multi-turn) simultaneously, leading to significant performance drops. We
further analyze the robustness behavior and error reasons of LLMs in our
benchmark, which provide valuable insights for researchers to understand the
LLM's robustness in task completion and develop more robust LLMs and agents. We
release the code and data at \url{https://github.com/ZekaiGalaxy/PPTCR}.
Related papers
- SimulBench: Evaluating Language Models with Creative Simulation Tasks [20.233111652638637]
We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios.
A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI.
arXiv Detail & Related papers (2024-09-11T21:53:20Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task.
We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z) - State of What Art? A Call for Multi-Prompt LLM Evaluation [28.307860675006545]
We comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances.
To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead.
arXiv Detail & Related papers (2023-12-31T22:21:36Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task
Completion [96.47420221442397]
We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions.
We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence.
The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
arXiv Detail & Related papers (2023-11-03T08:06:35Z) - FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models.
We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level.
By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.