Evaluating Large Language Models for Document-grounded Response
Generation in Information-Seeking Dialogues
- URL: http://arxiv.org/abs/2309.11838v1
- Date: Thu, 21 Sep 2023 07:28:03 GMT
- Title: Evaluating Large Language Models for Document-grounded Response
Generation in Information-Seeking Dialogues
- Authors: Norbert Braunschweiler and Rama Doddipatla and Simon Keizer and
Svetlana Stoyanchev
- Abstract summary: We investigate the use of large language models (LLMs) like ChatGPT for document-grounded response generation in the context of information-seeking dialogues.
For evaluation, we use the MultiDoc2Dial corpus of task-oriented dialogues in four social service domains.
While both ChatGPT variants are more likely to include information not present in the relevant segments, possibly including a presence of hallucinations, they are rated higher than both the shared task winning system and human responses.
- Score: 17.41334279810008
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we investigate the use of large language models (LLMs) like
ChatGPT for document-grounded response generation in the context of
information-seeking dialogues. For evaluation, we use the MultiDoc2Dial corpus
of task-oriented dialogues in four social service domains previously used in
the DialDoc 2022 Shared Task. Information-seeking dialogue turns are grounded
in multiple documents providing relevant information. We generate dialogue
completion responses by prompting a ChatGPT model, using two methods:
Chat-Completion and LlamaIndex. ChatCompletion uses knowledge from ChatGPT
model pretraining while LlamaIndex also extracts relevant information from
documents. Observing that document-grounded response generation via LLMs cannot
be adequately assessed by automatic evaluation metrics as they are
significantly more verbose, we perform a human evaluation where annotators rate
the output of the shared task winning system, the two Chat-GPT variants
outputs, and human responses. While both ChatGPT variants are more likely to
include information not present in the relevant segments, possibly including a
presence of hallucinations, they are rated higher than both the shared task
winning system and human responses.
Related papers
- Effective and Efficient Conversation Retrieval for Dialogue State Tracking with Implicit Text Summaries [48.243879779374836]
Few-shot dialogue state tracking (DST) with Large Language Models (LLM) relies on an effective and efficient conversation retriever to find similar in-context examples for prompt learning.
Previous works use raw dialogue context as search keys and queries, and a retriever is fine-tuned with annotated dialogues to achieve superior performance.
We handle the task of conversation retrieval based on text summaries of the conversations.
A LLM-based conversation summarizer is adopted for query and key generation, which enables effective maximum inner product search.
arXiv Detail & Related papers (2024-02-20T14:31:17Z) - Evaluating Large Language Models in Semantic Parsing for Conversational
Question Answering over Knowledge Graphs [6.869834883252353]
This paper evaluates the performance of large language models that have not been explicitly pre-trained on this task.
Our results demonstrate that large language models are capable of generating graph queries from dialogues.
arXiv Detail & Related papers (2024-01-03T12:28:33Z) - FCC: Fusing Conversation History and Candidate Provenance for Contextual
Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels.
We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - Dialogue History Matters! Personalized Response Selectionin Multi-turn
Retrieval-based Chatbots [62.295373408415365]
We propose a personalized hybrid matching network (PHMN) for context-response matching.
Our contributions are two-fold: 1) our model extracts personalized wording behaviors from user-specific dialogue history as extra matching information.
We evaluate our model on two large datasets with user identification, i.e., personalized dialogue Corpus Ubuntu (P- Ubuntu) and personalized Weibo dataset (P-Weibo)
arXiv Detail & Related papers (2021-03-17T09:42:11Z) - A Compare Aggregate Transformer for Understanding Document-grounded
Dialogue [27.04964963480175]
We propose a Compare Aggregate Transformer (CAT) to jointly denoise the dialogue context and aggregate the document information for response generation.
Experimental results on the CMUDoG dataset show that the proposed CAT model outperforms the state-of-the-art approach and strong baselines.
arXiv Detail & Related papers (2020-10-01T03:44:44Z) - Detecting and Classifying Malevolent Dialogue Responses: Taxonomy, Data
and Methodology [68.8836704199096]
Corpus-based conversational interfaces are able to generate more diverse and natural responses than template-based or retrieval-based agents.
With their increased generative capacity of corpusbased conversational agents comes the need to classify and filter out malevolent responses.
Previous studies on the topic of recognizing and classifying inappropriate content are mostly focused on a certain category of malevolence.
arXiv Detail & Related papers (2020-08-21T22:43:27Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.