Let the LLMs Talk: Simulating Human-to-Human Conversational QA via
Zero-Shot LLM-to-LLM Interactions
- URL: http://arxiv.org/abs/2312.02913v1
- Date: Tue, 5 Dec 2023 17:38:02 GMT
- Title: Let the LLMs Talk: Simulating Human-to-Human Conversational QA via
Zero-Shot LLM-to-LLM Interactions
- Authors: Zahra Abbasiantaeb and Yifei Yuan and Evangelos Kanoulas and Mohammad
Aliannejadi
- Abstract summary: Conversational question-answering systems aim to create interactive search systems that retrieve information by interacting with users.
Existing work uses human annotators to play the roles of the questioner (student) and the answerer (teacher)
We propose a simulation framework that employs zero-shot learner LLMs for simulating teacher-student interactions.
- Score: 19.365615476223635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conversational question-answering (CQA) systems aim to create interactive
search systems that effectively retrieve information by interacting with users.
To replicate human-to-human conversations, existing work uses human annotators
to play the roles of the questioner (student) and the answerer (teacher).
Despite its effectiveness, challenges exist as human annotation is
time-consuming, inconsistent, and not scalable. To address this issue and
investigate the applicability of large language models (LLMs) in CQA
simulation, we propose a simulation framework that employs zero-shot learner
LLMs for simulating teacher-student interactions. Our framework involves two
LLMs interacting on a specific topic, with the first LLM acting as a student,
generating questions to explore a given search topic. The second LLM plays the
role of a teacher by answering questions and is equipped with additional
information, including a text on the given topic. We implement both the student
and teacher by zero-shot prompting the GPT-4 model. To assess the effectiveness
of LLMs in simulating CQA interactions and understand the disparities between
LLM- and human-generated conversations, we evaluate the simulated data from
various perspectives. We begin by evaluating the teacher's performance through
both automatic and human assessment. Next, we evaluate the performance of the
student, analyzing and comparing the disparities between questions generated by
the LLM and those generated by humans. Furthermore, we conduct extensive
analyses to thoroughly examine the LLM performance by benchmarking
state-of-the-art reading comprehension models on both datasets. Our results
reveal that the teacher LLM generates lengthier answers that tend to be more
accurate and complete. The student LLM generates more diverse questions,
covering more aspects of a given topic.
Related papers
- AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage.
CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers.
Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z) - Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue [25.89926022671521]
We generate a large-scale dataset of 100,000 paired LLM-LLM and human-LLM dialogues from the WildChat dataset.
We find relatively low alignment between simulations and human interactions, demonstrating a systematic divergence along the multiple textual properties.
arXiv Detail & Related papers (2024-09-12T18:00:18Z) - SimulBench: Evaluating Language Models with Creative Simulation Tasks [20.233111652638637]
We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios.
A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI.
arXiv Detail & Related papers (2024-09-11T21:53:20Z) - CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks.
The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions.
We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z) - Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models [66.24055500785657]
Traditional turn-based chat systems prevent users from verbally interacting with system while it is generating responses.
To overcome these limitations, we adapt existing LLMs to listen users while generating output and provide users with instant feedback.
We build a dataset consisting of alternating time slices of queries and responses as well as covering typical feedback types in instantaneous interactions.
arXiv Detail & Related papers (2024-06-22T03:20:10Z) - MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions [58.57255822646756]
This paper introduces MathChat, a benchmark designed to evaluate large language models (LLMs) across a broader spectrum of mathematical tasks.
We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios.
We develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations.
arXiv Detail & Related papers (2024-05-29T18:45:55Z) - Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation [15.718288693929019]
Large Language Models (LLM) achieve state-of-the-art performance on many NLP tasks.
We study whether LLMs can be used as substitutes for human annotators.
We find that LLMs outperform current automatic measures for system-level evaluation but still struggle to provide satisfactory explanations.
arXiv Detail & Related papers (2024-05-22T15:56:52Z) - Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk [11.706292228586332]
Large language models (LLMs) are powerful dialogue agents, but specializing them towards fulfilling a specific function can be challenging.
We propose a more effective method for data collection through LLMs engaging in a conversation in various roles.
This approach generates a training data via "self-talk" of LLMs that can be refined and utilized for supervised fine-tuning.
arXiv Detail & Related papers (2024-01-10T09:49:10Z) - Automated Assessment of Students' Code Comprehension using LLMs [0.3293989832773954]
Large Language Models (LLMs) and encoder-based Semantic Textual Similarity (STS) models are assessed.
Our findings indicate that LLMs, when prompted in few-shot and chain-of-thought setting, perform comparable to fine-tuned encoder-based models in evaluating students' short answers in programming domain.
arXiv Detail & Related papers (2023-12-19T20:39:12Z) - Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations [70.7884839812069]
Large language models (LLMs) have emerged as powerful and general solutions to many natural language tasks.
However, many of the most important applications of language generation are interactive, where an agent has to talk to a person to reach a desired outcome.
In this work, we explore a new method for adapting LLMs with RL for such goal-directed dialogue.
arXiv Detail & Related papers (2023-11-09T18:45:16Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.