Ditch the Gold Standard: Re-evaluating Conversational Question Answering
- URL: http://arxiv.org/abs/2112.08812v1
- Date: Thu, 16 Dec 2021 11:57:56 GMT
- Title: Ditch the Gold Standard: Re-evaluating Conversational Question Answering
- Authors: Huihan Li, Tianyu Gao, Manan Goenka, Danqi Chen
- Abstract summary: We conduct the first large-scale human evaluation of state-of-the-art CQA systems.
We find that the distribution of human-machine conversations differs drastically from that of human-human conversations.
We propose a question rewriting mechanism based on predicted history, which better correlates with human judgments.
- Score: 9.194536300785481
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational question answering (CQA) systems aim to provide
natural-language answers to users in information-seeking conversations.
Existing CQA benchmarks compare models with pre-collected human-human
conversations, using ground-truth answers provided in conversational history.
It remains unclear whether we can rely on this static evaluation for model
development and whether current systems can well generalize to real-world
human-machine conversations. In this work, we conduct the first large-scale
human evaluation of state-of-the-art CQA systems, where human evaluators
converse with models and judge the correctness of their answers. We find that
the distribution of human-machine conversations differs drastically from that
of human-human conversations, and there is a disagreement between human and
gold-history evaluation in terms of model ranking. We further investigate how
to improve automatic evaluations, and propose a question rewriting mechanism
based on predicted history, which better correlates with human judgments.
Finally, we discuss the impact of various modeling strategies and future
directions towards better conversational question answering systems.
Related papers
- IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering [10.338962367542331]
We introduce an automatic evaluation framework IQA-EVAL to achieve Interactive Question Answering Evaluations.
We also introduce a LLM-based Evaluation Agent (LEA) that can simulate human behaviors to generate interactions with IQA models.
We show that our evaluation framework with GPT-4 as the backbone model achieves a high correlation with human evaluations on the IQA task.
arXiv Detail & Related papers (2024-08-24T10:34:20Z) - Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain
Dialogue Systems [14.98159964397052]
We analyzed and examined what features are needed in an automatic response evaluator from the interlocutor's perspective.
The first experiment on the Hazumi dataset revealed that interlocutor awareness plays a critical role in making automatic response evaluation correlate with the interlocutor's judgments.
The second experiment using massive conversations on X (formerly Twitter) confirmed that dialogue continuity prediction can train an interlocutor-aware response evaluator without human feedback.
arXiv Detail & Related papers (2024-01-04T13:15:41Z) - AutoConv: Automatically Generating Information-seeking Conversations
with Large Language Models [74.10293412011455]
We propose AutoConv for synthetic conversation generation.
Specifically, we formulate the conversation generation problem as a language modeling task.
We finetune an LLM with a few human conversations to capture the characteristics of the information-seeking process.
arXiv Detail & Related papers (2023-08-12T08:52:40Z) - PLACES: Prompting Language Models for Social Conversation Synthesis [103.94325597273316]
We use a small set of expert-written conversations as in-context examples to synthesize a social conversation dataset using prompting.
We perform several thorough evaluations of our synthetic conversations compared to human-collected conversations.
arXiv Detail & Related papers (2023-02-07T05:48:16Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - Intelligent Conversational Android ERICA Applied to Attentive Listening
and Job Interview [41.789773897391605]
We have developed an intelligent conversational android ERICA.
We set up several social interaction tasks for ERICA, including attentive listening, job interview, and speed dating.
It has been evaluated with 40 senior people, engaged in conversation of 5-7 minutes without a conversation breakdown.
arXiv Detail & Related papers (2021-05-02T06:37:23Z) - BERT-CoQAC: BERT-based Conversational Question Answering in Context [10.811729691130349]
We introduce a framework based on a publically available pre-trained language model called BERT for incorporating history turns into the system.
Experiment results revealed that our framework is comparable in performance with the state-of-the-art models on the QuAC leader board.
arXiv Detail & Related papers (2021-04-23T03:05:17Z) - Human-like informative conversations: Better acknowledgements using
conditional mutual information [0.0]
This work aims to build a dialogue agent that can weave new factual content into conversations as naturally as humans.
We draw insights from linguistic principles of conversational analysis and annotate human-human conversations from the Switchboard Dialog Act Corpus.
arXiv Detail & Related papers (2021-04-16T00:13:57Z) - Towards Data Distillation for End-to-end Spoken Conversational Question
Answering [65.124088336738]
We propose a new Spoken Conversational Question Answering task (SCQA)
SCQA aims at enabling QA systems to model complex dialogues flow given the speech utterances and text corpora.
Our main objective is to build a QA system to deal with conversational questions both in spoken and text forms.
arXiv Detail & Related papers (2020-10-18T05:53:39Z) - Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction.
We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data.
Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.