MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language
Feedback
- URL: http://arxiv.org/abs/2309.10691v3
- Date: Tue, 12 Mar 2024 15:53:06 GMT
- Title: MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language
Feedback
- Authors: Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao
Peng, Heng Ji
- Abstract summary: We introduce MINT, a benchmark that evaluates large language models' ability to solve tasks with multi-turn interactions.
LLMs generally benefit from tools and language feedback, with performance gains of 1-8% for each turn of tool use.
LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities.
- Score: 78.60644407028022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To solve complex tasks, large language models (LLMs) often require multiple
rounds of interactions with the user, sometimes assisted by external tools.
However, current evaluation protocols often emphasize benchmark performance
with single-turn exchanges, neglecting the nuanced interactions among the user,
LLMs, and external tools, while also underestimating the importance of natural
language feedback from users. These oversights contribute to discrepancies
between research benchmark evaluations and real-world use cases. We introduce
MINT, a benchmark that evaluates LLMs' ability to solve tasks with multi-turn
interactions by (1) using tools and (2) leveraging natural language feedback.
To ensure reproducibility, we provide an evaluation framework where LLMs can
access tools by executing Python code and receive users' natural language
feedback simulated by GPT-4. We repurpose a diverse set of established
evaluation datasets focusing on reasoning, coding, and decision-making and
carefully curate them into a compact subset for efficient evaluation. Our
analysis of 20 open- and closed-source LLMs offers intriguing findings. (a)
LLMs generally benefit from tools and language feedback, with performance gains
(absolute, same below) of 1-8% for each turn of tool use and 2-17% with natural
language feedback. (b) Better single-turn performance does not guarantee better
multi-turn performance. (c) Surprisingly, on the LLMs evaluated, supervised
instruction-finetuning (SIFT) and reinforcement learning from human feedback
(RLHF) generally hurt multi-turn capabilities. We expect MINT can help measure
progress and incentivize research in improving LLMs' capabilities in multi-turn
interactions, especially for open-source communities where multi-turn human
evaluation can be less accessible compared to commercial LLMs with a larger
user base.
Related papers
- Learning to Ask: When LLMs Meet Unclear Instruction [49.256630152684764]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone.
We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench.
We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z) - CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks.
The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions.
We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z) - Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond [24.151927600694066]
Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs.
This paper conducts the first comprehensive experiment to investigate how far we have been in applying Large Language Models (LLMs) to generate high-quality commit messages.
arXiv Detail & Related papers (2024-04-23T08:24:43Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language
Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs)
Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z) - Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models [79.32652077838046]
We introduce Parrot, a solution aiming to enhance multi-turn instruction following for large language models (LLMs)
First, we introduce an efficient but effective method for collecting multi-turn instructions that feature human-like queries, such as anaphora and ellipsis.
Second, we propose a context-aware preference optimization strategy to further enhance LLMs for complex queries in multi-turn interaction.
arXiv Detail & Related papers (2023-10-11T08:36:43Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.