Meta-evaluation of Conversational Search Evaluation Metrics
- URL: http://arxiv.org/abs/2104.13453v1
- Date: Tue, 27 Apr 2021 20:01:03 GMT
- Title: Meta-evaluation of Conversational Search Evaluation Metrics
- Authors: Zeyang Liu, Ke Zhou and Max L. Wilson
- Abstract summary: We systematically meta-evaluate a variety of conversational search metrics.
We find that METEOR is the best existing single-turn metric considering all three perspectives.
We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search.
- Score: 15.942419892035124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational search systems, such as Google Assistant and Microsoft
Cortana, enable users to interact with search systems in multiple rounds
through natural language dialogues. Evaluating such systems is very challenging
given that any natural language responses could be generated, and users
commonly interact for multiple semantically coherent rounds to accomplish a
search task. Although prior studies proposed many evaluation metrics, the
extent of how those measures effectively capture user preference remains to be
investigated. In this paper, we systematically meta-evaluate a variety of
conversational search metrics. We specifically study three perspectives on
those metrics: (1) reliability: the ability to detect "actual" performance
differences as opposed to those observed by chance; (2) fidelity: the ability
to agree with ultimate user preference; and (3) intuitiveness: the ability to
capture any property deemed important: adequacy, informativeness, and fluency
in the context of conversational search. By conducting experiments on two test
collections, we find that the performance of different metrics varies
significantly across different scenarios whereas consistent with prior studies,
existing metrics only achieve a weak correlation with ultimate user preference
and satisfaction. METEOR is, comparatively speaking, the best existing
single-turn metric considering all three perspectives. We also demonstrate that
adapted session-based evaluation metrics can be used to measure multi-turn
conversational search, achieving moderate concordance with user satisfaction.
To our knowledge, our work establishes the most comprehensive meta-evaluation
for conversational search to date.
Related papers
- TaskDiff: A Similarity Metric for Task-Oriented Conversations [6.136198298002772]
We present TaskDiff, a novel conversational similarity metric.
It uses different dialogue components (utterances, intents, and slots) and their distributions to compute similarity.
arXiv Detail & Related papers (2023-10-23T19:03:35Z) - FCC: Fusing Conversation History and Candidate Provenance for Contextual
Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels.
We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z) - FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation [58.46761798403072]
We propose a dialogue-level metric that consists of three sub-metrics with each targeting a specific dimension.
The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions.
Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average.
arXiv Detail & Related papers (2022-10-25T08:26:03Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z) - BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation [16.81712151903078]
Natural language processing (NLP) systems are increasingly trained to generate open-ended text.
Different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others.
Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics) to make research into new metrics itself easier to evaluate.
arXiv Detail & Related papers (2021-10-18T10:03:19Z) - POSSCORE: A Simple Yet Effective Evaluation of Conversational Search
with Part of Speech Labelling [25.477834359694473]
Conversational search systems, such as Google Assistant and Microsoft Cortana, provide a new search paradigm where users are allowed, via natural language dialogues, to communicate with search systems.
We propose POSSCORE, a simple yet effective automatic evaluation method for conversational search.
We show that our metrics can correlate with human preference, achieving significant improvements over state-of-the-art baseline metrics.
arXiv Detail & Related papers (2021-09-07T12:31:29Z) - Dialogue History Matters! Personalized Response Selectionin Multi-turn
Retrieval-based Chatbots [62.295373408415365]
We propose a personalized hybrid matching network (PHMN) for context-response matching.
Our contributions are two-fold: 1) our model extracts personalized wording behaviors from user-specific dialogue history as extra matching information.
We evaluate our model on two large datasets with user identification, i.e., personalized dialogue Corpus Ubuntu (P- Ubuntu) and personalized Weibo dataset (P-Weibo)
arXiv Detail & Related papers (2021-03-17T09:42:11Z) - Deconstruct to Reconstruct a Configurable Evaluation Metric for
Open-Domain Dialogue Systems [36.73648357051916]
In open-domain dialogue, the overall quality is comprised of various aspects, such as relevancy, specificity, and empathy.
Existing metrics are not designed to cope with such flexibility.
We propose a simple method to composite metrics of each aspect to obtain a single metric called USL-H.
arXiv Detail & Related papers (2020-11-01T11:34:50Z) - Multi-Stage Conversational Passage Retrieval: An Approach to Fusing Term
Importance Estimation and Neural Query Rewriting [56.268862325167575]
We tackle conversational passage retrieval (ConvPR) with query reformulation integrated into a multi-stage ad-hoc IR system.
We propose two conversational query reformulation (CQR) methods: (1) term importance estimation and (2) neural query rewriting.
For the former, we expand conversational queries using important terms extracted from the conversational context with frequency-based signals.
For the latter, we reformulate conversational queries into natural, standalone, human-understandable queries with a pretrained sequence-tosequence model.
arXiv Detail & Related papers (2020-05-05T14:30:20Z) - Topic Propagation in Conversational Search [0.0]
In a conversational context, a user expresses her multi-faceted information need as a sequence of natural-language questions.
We adopt the 2019 TREC Conversational Assistant Track (CAsT) framework to experiment with a modular architecture performing: (i) topic-aware utterance rewriting, (ii) retrieval of candidate passages for the rewritten utterances, and (iii) neural-based re-ranking of candidate passages.
arXiv Detail & Related papers (2020-04-29T10:06:00Z) - IART: Intent-aware Response Ranking with Transformers in
Information-seeking Conversation Systems [80.0781718687327]
We analyze user intent patterns in information-seeking conversations and propose an intent-aware neural response ranking model "IART"
IART is built on top of the integration of user intent modeling and language representation learning with the Transformer architecture.
arXiv Detail & Related papers (2020-02-03T05:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.