BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues
- URL: http://arxiv.org/abs/2310.13650v1
- Date: Fri, 20 Oct 2023 16:53:51 GMT
- Title: BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues
- Authors: Haodong Duan, Jueqi Wei, Chonghua Wang, Hongwei Liu, Yixiao Fang,
Songyang Zhang, Dahua Lin, Kai Chen
- Abstract summary: This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting.
We prompt large language models (LLMs) to generate a full multi-turn dialogue based on the ChatSEED, utterance by utterance.
We find GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts.
- Score: 72.65163468440434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interacting with human via high-quality multi-turn dialogues is a key feature
of large language models (LLMs). However, human-based evaluation of such
capability involves intensive manual labor. This report provides a preliminary
evaluation of existing large language models for human-style multi-turn
chatting, through an LLM-based approach. We start from real-world human
dialogues and keep the very first utterances as the ChatSEED. Then we prompt
LLMs to generate a full multi-turn dialogue (tens of utterances) based on the
ChatSEED, utterance by utterance. Finally, we adopt state-of-the-art LLMs
(GPT-4, \etc) as the judge to evaluate the generated dialogues. With different
evaluation protocols, we come to substantially identical conclusions. We find
that GPT-4 can generate human-style multi-turn dialogues with impressive
quality, significantly outperforms its counterparts. It's difficult for a
discriminator to distinguish between GPT-4 generated dialogues and human
dialogues. In contrast, other LLMs struggle to generate multi-turn dialogues of
satisfactory quality due to poor instruction-following capability, tendency to
generate lengthy utterances, or limited general capability. All data and codes
will be provided in https://github.com/open-compass/BotChat/ and we hope they
can serve as a valuable resource for evaluating multi-turn chatting
capabilities of LLMs.
Related papers
- Self-Directed Turing Test for Large Language Models [56.64615470513102]
The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations.
Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time.
This paper proposes the Self-Directed Turing Test, which extends the original test with a burst dialogue format.
arXiv Detail & Related papers (2024-08-19T09:57:28Z) - LLM Roleplay: Simulating Human-Chatbot Interaction [52.03241266241294]
We propose a goal-oriented, persona-based method to automatically generate diverse multi-turn dialogues simulating human-chatbot interaction.
Our method can simulate human-chatbot dialogues with a high indistinguishability rate.
arXiv Detail & Related papers (2024-07-04T14:49:46Z) - Can LLMs Understand the Implication of Emphasized Sentences in Dialogue? [64.72966061510375]
Emphasis is a crucial component in human communication, which indicates the speaker's intention and implication beyond pure text in dialogue.
This paper introduces Emphasized-Talk, a benchmark with emphasis-annotated dialogue samples capturing the implications of emphasis.
We evaluate various Large Language Models (LLMs), both open-source and commercial, to measure their performance in understanding emphasis.
arXiv Detail & Related papers (2024-06-16T20:41:44Z) - Think Before You Speak: Cultivating Communication Skills of Large Language Models via Inner Monologue [73.69510478736483]
Large language models (LLMs) can generate fluent, coherent, and diverse responses.
However, they lack a crucial ability: communication skills.
This article aims to empower LLMs with communication skills through inner monologues.
Experimental results show that the proposed CSIM strategy improves the backbone models and outperforms the baselines.
arXiv Detail & Related papers (2023-11-13T16:19:42Z) - DialogBench: Evaluating LLMs as Human-like Dialogue Systems [16.997134341787486]
Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities by leveraging instruction tuning.
In this paper, we propose DialogBench, a dialogue evaluation benchmark that contains 12 dialogue tasks.
We show that instruction tuning improves the human likeness of LLMs to a certain extent, but most LLMs still have much room for improvement as human-like dialogue systems.
arXiv Detail & Related papers (2023-11-03T02:59:56Z) - A Mixture-of-Expert Approach to RL-based Dialogue Management [56.08449336469477]
We use reinforcement learning to develop a dialogue agent that avoids being short-sighted (outputting generic utterances) and maximizes overall user satisfaction.
Most existing RL approaches to DM train the agent at the word-level, and thus, have to deal with aly complex action space even for a medium-size vocabulary.
We develop a RL-based DM using a novel mixture of expert language model (MoE-LM) that consists of (i) a LM capable of learning diverse semantics for conversation histories, (ii) a number of specialized LMs (or experts) capable of generating utterances corresponding to a
arXiv Detail & Related papers (2022-05-31T19:00:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.