Related papers: BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues

BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues

URL: http://arxiv.org/abs/2310.13650v1
Date: Fri, 20 Oct 2023 16:53:51 GMT
Title: BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues
Authors: Haodong Duan, Jueqi Wei, Chonghua Wang, Hongwei Liu, Yixiao Fang, Songyang Zhang, Dahua Lin, Kai Chen
Abstract summary: This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting. We prompt large language models (LLMs) to generate a full multi-turn dialogue based on the ChatSEED, utterance by utterance. We find GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts.
Score: 72.65163468440434
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Interacting with human via high-quality multi-turn dialogues is a key feature of large language models (LLMs). However, human-based evaluation of such capability involves intensive manual labor. This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting, through an LLM-based approach. We start from real-world human dialogues and keep the very first utterances as the ChatSEED. Then we prompt LLMs to generate a full multi-turn dialogue (tens of utterances) based on the ChatSEED, utterance by utterance. Finally, we adopt state-of-the-art LLMs (GPT-4, \etc) as the judge to evaluate the generated dialogues. With different evaluation protocols, we come to substantially identical conclusions. We find that GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts. It's difficult for a discriminator to distinguish between GPT-4 generated dialogues and human dialogues. In contrast, other LLMs struggle to generate multi-turn dialogues of satisfactory quality due to poor instruction-following capability, tendency to generate lengthy utterances, or limited general capability. All data and codes will be provided in https://github.com/open-compass/BotChat/ and we hope they can serve as a valuable resource for evaluating multi-turn chatting capabilities of LLMs.

Related papers

DialogueForge: LLM Simulation of Human-Chatbot Dialogue [7.038493120049631]
We propose DialogueForge as a framework for generating AI-simulated conversations in human-chatbot style.<n>To each generated conversation, DialogueForge uses seed prompts extracted from real human-chatbot interactions.<n>We evaluate the quality of the simulated conversations and compare different models using the UniEval and GTEval evaluation protocols.
arXiv Detail & Related papers (2025-07-21T16:08:19Z)
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models [58.43486430996411]
Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans. Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth audio dialogues with humans. We propose an Audio Dialogue Understanding Benchmark (ADU-Bench) to evaluate the performance of LALMs in the open-ended audio dialogue understanding.
arXiv Detail & Related papers (2024-12-06T16:34:15Z)
Self-Directed Turing Test for Large Language Models [56.64615470513102]
The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations. Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time. This paper proposes the Self-Directed Turing Test, which extends the original test with a burst dialogue format.
arXiv Detail & Related papers (2024-08-19T09:57:28Z)
LLM Roleplay: Simulating Human-Chatbot Interaction [52.03241266241294]
We propose a goal-oriented, persona-based method to automatically generate diverse multi-turn dialogues simulating human-chatbot interaction. Our method can simulate human-chatbot dialogues with a high indistinguishability rate.
arXiv Detail & Related papers (2024-07-04T14:49:46Z)
Can LLMs Understand the Implication of Emphasized Sentences in Dialogue? [64.72966061510375]
Emphasis is a crucial component in human communication, which indicates the speaker's intention and implication beyond pure text in dialogue. This paper introduces Emphasized-Talk, a benchmark with emphasis-annotated dialogue samples capturing the implications of emphasis. We evaluate various Large Language Models (LLMs), both open-source and commercial, to measure their performance in understanding emphasis.
arXiv Detail & Related papers (2024-06-16T20:41:44Z)
Think Before You Speak: Cultivating Communication Skills of Large Language Models via Inner Monologue [73.69510478736483]
Large language models (LLMs) can generate fluent, coherent, and diverse responses. However, they lack a crucial ability: communication skills. This article aims to empower LLMs with communication skills through inner monologues. Experimental results show that the proposed CSIM strategy improves the backbone models and outperforms the baselines.
arXiv Detail & Related papers (2023-11-13T16:19:42Z)
DialogBench: Evaluating LLMs as Human-like Dialogue Systems [16.997134341787486]
Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities by leveraging instruction tuning. In this paper, we propose DialogBench, a dialogue evaluation benchmark that contains 12 dialogue tasks. We show that instruction tuning improves the human likeness of LLMs to a certain extent, but most LLMs still have much room for improvement as human-like dialogue systems.
arXiv Detail & Related papers (2023-11-03T02:59:56Z)
A Mixture-of-Expert Approach to RL-based Dialogue Management [56.08449336469477]
We use reinforcement learning to develop a dialogue agent that avoids being short-sighted (outputting generic utterances) and maximizes overall user satisfaction. Most existing RL approaches to DM train the agent at the word-level, and thus, have to deal with aly complex action space even for a medium-size vocabulary. We develop a RL-based DM using a novel mixture of expert language model (MoE-LM) that consists of (i) a LM capable of learning diverse semantics for conversation histories, (ii) a number of specialized LMs (or experts) capable of generating utterances corresponding to a
arXiv Detail & Related papers (2022-05-31T19:00:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.