SheetCopilot: Bringing Software Productivity to the Next Level through
Large Language Models
- URL: http://arxiv.org/abs/2305.19308v2
- Date: Mon, 30 Oct 2023 06:36:24 GMT
- Title: SheetCopilot: Bringing Software Productivity to the Next Level through
Large Language Models
- Authors: Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, Zhaoxiang Zhang
- Abstract summary: We propose a SheetCopilot agent that takes natural language task and control spreadsheet to fulfill the requirements.
We curate a representative dataset containing 221 spreadsheet control tasks and establish a fully automated evaluation pipeline.
Our SheetCopilot correctly completes 44.3% of tasks for a single generation, outperforming the strong code generation baseline by a wide margin.
- Score: 60.171444066848856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computer end users have spent billions of hours completing daily tasks like
tabular data processing and project timeline scheduling. Most of these tasks
are repetitive and error-prone, yet most end users lack the skill to automate
these burdensome works. With the advent of large language models (LLMs),
directing software with natural language user requests become a reachable goal.
In this work, we propose a SheetCopilot agent that takes natural language task
and control spreadsheet to fulfill the requirements. We propose a set of atomic
actions as an abstraction of spreadsheet software functionalities. We further
design a state machine-based task planning framework for LLMs to robustly
interact with spreadsheets. We curate a representative dataset containing 221
spreadsheet control tasks and establish a fully automated evaluation pipeline
for rigorously benchmarking the ability of LLMs in software control tasks. Our
SheetCopilot correctly completes 44.3\% of tasks for a single generation,
outperforming the strong code generation baseline by a wide margin. Our project
page:https://sheetcopilot.github.io/.
Related papers
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks [50.27313829438866]
Plan-Seq-Learn (PSL) is a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control.
PSL achieves success rates of over 85%, out-performing language-based, classical, and end-to-end approaches.
arXiv Detail & Related papers (2024-05-02T17:59:31Z) - m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks [31.031053149807857]
We introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools.
For each of these task queries, we provide automatically generated plans using this realistic toolset.
We provide a high-quality subset of 1,565 task plans that are human-verified and correctly.
arXiv Detail & Related papers (2024-03-17T04:36:18Z) - SheetAgent: Towards A Generalist Agent for Spreadsheet Reasoning and Manipulation via Large Language Models [40.631127096231886]
Large language model (LLM) has been recently attempted for automatic spreadsheet manipulation but has not yet been investigated in realistic tasks.
We introduce $textbfSheetRM$, a benchmark featuring long-horizon and multi-category tasks with reasoning-dependent manipulation.
We further propose $textbfSheetAgent$, a novel autonomous agent that utilizes the power of LLMs.
arXiv Detail & Related papers (2024-03-06T11:48:08Z) - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web [43.60736044871539]
We introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate programs.
The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet"
Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks.
arXiv Detail & Related papers (2024-02-27T14:47:53Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation.
Specifically, task decomposition, tool selection, and parameter prediction are assessed.
Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z) - AutoScrum: Automating Project Planning Using Large Language Models [0.0]
Large language models have made it possible to use language models for advanced reasoning.
In this paper we leverage this ability for designing complex project plans based only on knowing the current state and the desired state.
Two approaches are demonstrated - a scrum based approach and a shortcut plan approach.
arXiv Detail & Related papers (2023-06-05T19:16:37Z) - Plan, Eliminate, and Track -- Language Models are Good Teachers for
Embodied Agents [99.17668730578586]
Pre-trained large language models (LLMs) capture procedural knowledge about the world.
Plan, Eliminate, and Track (PET) framework translates a task description into a list of high-level sub-tasks.
PET framework leads to a significant 15% improvement over SOTA for generalization to human goal specifications.
arXiv Detail & Related papers (2023-05-03T20:11:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.