Chameleon: Plug-and-Play Compositional Reasoning with Large Language
Models
- URL: http://arxiv.org/abs/2304.09842v3
- Date: Tue, 31 Oct 2023 17:43:39 GMT
- Title: Chameleon: Plug-and-Play Compositional Reasoning with Large Language
Models
- Authors: Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying
Nian Wu, Song-Chun Zhu, Jianfeng Gao
- Abstract summary: Large language models (LLMs) have achieved remarkable progress in solving various natural language processing tasks.
However, they have inherent limitations as they are incapable of accessing up-to-date information.
We present Chameleon, an AI system that augments LLMs with plug-and-play modules for compositional reasoning.
- Score: 187.58051653991686
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have achieved remarkable progress in solving
various natural language processing tasks due to emergent reasoning abilities.
However, LLMs have inherent limitations as they are incapable of accessing
up-to-date information (stored on the Web or in task-specific knowledge bases),
using external tools, and performing precise mathematical and logical
reasoning. In this paper, we present Chameleon, an AI system that mitigates
these limitations by augmenting LLMs with plug-and-play modules for
compositional reasoning. Chameleon synthesizes programs by composing various
tools (e.g., LLMs, off-the-shelf vision models, web search engines, Python
functions, and heuristic-based modules) for accomplishing complex reasoning
tasks. At the heart of Chameleon is an LLM-based planner that assembles a
sequence of tools to execute to generate the final response. We showcase the
effectiveness of Chameleon on two multi-modal knowledge-intensive reasoning
tasks: ScienceQA and TabMWP. Chameleon, powered by GPT-4, achieves an 86.54%
overall accuracy on ScienceQA, improving the best published few-shot result by
11.37%. On TabMWP, GPT-4-powered Chameleon improves the accuracy by 17.0%,
lifting the state of the art to 98.78%. Our analysis also shows that the
GPT-4-powered planner exhibits more consistent and rational tool selection via
inferring potential constraints from instructions, compared to a
ChatGPT-powered planner. The project is available at
https://chameleon-llm.github.io.
Related papers
- Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks [35.97890508648945]
We introduce the-20B-FUNCTIONCALLING model under an Apache 2.0 license.
The model is trained using a multi-task training approach on seven fundamental tasks.
We show that-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets.
arXiv Detail & Related papers (2024-06-27T17:47:26Z) - BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks.
Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%.
We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing.
As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework.
This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z) - Testing LLMs on Code Generation with Varying Levels of Prompt
Specificity [0.0]
Large language models (LLMs) have demonstrated unparalleled prowess in mimicking human-like text generation and processing.
The potential to transform natural language prompts into executable code promises a major shift in software development practices.
arXiv Detail & Related papers (2023-11-10T23:41:41Z) - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized
Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs)
It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks.
Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z) - AskIt: Unified Programming Interface for Programming with Large Language
Models [0.0]
Large Language Models (LLMs) exhibit a unique phenomenon known as emergent abilities, demonstrating adeptness across numerous tasks.
This paper introduces AskIt, a domain-specific language specifically designed for LLMs.
Across 50 tasks, AskIt generated concise prompts, achieving a 16.14 % reduction in prompt length compared to benchmarks.
arXiv Detail & Related papers (2023-08-29T21:44:27Z) - TART: A plug-and-play Transformer module for task-agnostic reasoning [38.84903599406189]
Large language models (LLMs) exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training.
Traditional adaptation approaches, such as fine-tuning, modify the underlying models for each specific task.
We propose TART which generically improves an LLM's reasoning abilities using a synthetically trained Transformer-based reasoning module.
arXiv Detail & Related papers (2023-06-13T04:37:00Z) - Plan, Eliminate, and Track -- Language Models are Good Teachers for
Embodied Agents [99.17668730578586]
Pre-trained large language models (LLMs) capture procedural knowledge about the world.
Plan, Eliminate, and Track (PET) framework translates a task description into a list of high-level sub-tasks.
PET framework leads to a significant 15% improvement over SOTA for generalization to human goal specifications.
arXiv Detail & Related papers (2023-05-03T20:11:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.