Related papers: TaskLAMA: Probing the Complex Task Understanding of Language Models

TaskLAMA: Probing the Complex Task Understanding of Language Models

URL: http://arxiv.org/abs/2308.15299v1
Date: Tue, 29 Aug 2023 13:36:45 GMT
Title: TaskLAMA: Probing the Complex Task Understanding of Language Models
Authors: Quan Yuan, Mehran Kazemi, Xin Xu, Isaac Noble, Vaiva Imbrasaite, Deepak Ramachandran
Abstract summary: Structured Complex Task Decomposition (SCTD) is a problem of breaking down a complex real-world task into a directed acyclic graph over individual steps that contribute to achieving the task. We probe how accurately SCTD can be done with the knowledge extracted from Large Language Models (LLMs) Our experiments reveal that LLMs are able to decompose complex tasks into individual steps effectively, with a relative improvement of 15% to 280% over the best baseline.
Score: 13.336015994186955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Structured Complex Task Decomposition (SCTD) is the problem of breaking down a complex real-world task (such as planning a wedding) into a directed acyclic graph over individual steps that contribute to achieving the task, with edges specifying temporal dependencies between them. SCTD is an important component of assistive planning tools, and a challenge for commonsense reasoning systems. We probe how accurately SCTD can be done with the knowledge extracted from Large Language Models (LLMs). We introduce a high-quality human-annotated dataset for this problem and novel metrics to fairly assess performance of LLMs against several baselines. Our experiments reveal that LLMs are able to decompose complex tasks into individual steps effectively, with a relative improvement of 15% to 280% over the best baseline. We also propose a number of approaches to further improve their performance, with a relative improvement of 7% to 37% over the base model. However, we find that LLMs still struggle to predict pairwise temporal dependencies, which reveals a gap in their understanding of complex tasks.

Related papers

SELT: Self-Evaluation Tree Search for LLMs with Task Decomposition [5.5688696788198975]
We introduce SELT (Self-Evaluation LLM Tree Search), a novel framework to enhance LLM reasoning without relying on external reward models.<n>We validate our approach on challenging benchmarks, including the knowledge-based MMLU and the Tool Learning dataset Seal-Tools.
arXiv Detail & Related papers (2025-06-09T08:52:27Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks. However, they still struggle with problems requiring multi-step decision-making and environmental feedback. We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [39.606908488885125]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs) It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities. Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z)
Prompt Recursive Search: A Living Framework with Adaptive Growth in LLM Auto-Prompting [22.025533583703126]
We propose a novel Prompt Recursive Search (PRS) framework for large language models (LLMs) PRS framework incorporates an assessment of problem complexity and an adjustable structure, ensuring a reduction in the likelihood of errors. Compared to the Chain of Thought (CoT) method, the PRS method has increased the accuracy on the BBH dataset by 8% using Llama3-7B model, achieving a 22% improvement.
arXiv Detail & Related papers (2024-08-02T17:59:42Z)
Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs [59.76268575344119]
We introduce a novel framework for enhancing large language models' (LLMs) planning capabilities by using planning data derived from knowledge graphs (KGs) LLMs fine-tuned with KG data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval.
arXiv Detail & Related papers (2024-06-20T13:07:38Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning [56.82041895921434]
Open-source pre-trained Large Language Models (LLMs) exhibit strong language understanding and generation capabilities. When used as agents for dealing with complex problems in the real world, their performance is far inferior to large commercial models such as ChatGPT and GPT-4.
arXiv Detail & Related papers (2024-03-29T03:48:12Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
ADaPT: As-Needed Decomposition and Planning with Language Models [131.063805299796]
We introduce As-Needed Decomposition and Planning for complex Tasks (ADaPT) ADaPT explicitly plans and decomposes complex sub-tasks as-needed, when the Large Language Models is unable to execute them. Our results demonstrate that ADaPT substantially outperforms established strong baselines.
arXiv Detail & Related papers (2023-11-08T17:59:15Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
Learning to Perform Complex Tasks through Compositional Fine-Tuning of Language Models [20.173322408302134]
compositional fine-tuning is an approach based on explicitly decomposing a target task into component tasks. We show that CFT outperforms end-to-end learning even with equal amounts of data.
arXiv Detail & Related papers (2022-10-23T03:22:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.