Turn-level Dialog Evaluation with Dialog-level Weak Signals for
Bot-Human Hybrid Customer Service Systems
- URL: http://arxiv.org/abs/2011.06395v1
- Date: Sun, 25 Oct 2020 19:36:23 GMT
- Title: Turn-level Dialog Evaluation with Dialog-level Weak Signals for
Bot-Human Hybrid Customer Service Systems
- Authors: Ruofeng Wen
- Abstract summary: We developed a machine learning approach that quantifies multiple aspects of the success or values in Customer Service contacts, at anytime during the interaction.
We show how it improves Amazon customer service quality in several applications.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We developed a machine learning approach that quantifies multiple aspects of
the success or values in Customer Service contacts, at anytime during the
interaction. Specifically, the value/reward function regarding to the
turn-level behaviors across human agents, chatbots and other hybrid dialog
systems is characterized by the incremental information and confidence gain
between sentences, based on the token-level predictions from a multi-task
neural network trained with only weak signals in dialog-level
attributes/states. The resulting model, named Value Profiler, serves as a
goal-oriented dialog manager that enhances conversations by regulating
automated decisions with its reward and state predictions. It supports both
real-time monitoring and scalable offline customer experience evaluation, for
both bot- and human-handled contacts. We show how it improves Amazon customer
service quality in several applications.
Related papers
- Toward More Accurate and Generalizable Evaluation Metrics for
Task-Oriented Dialogs [19.43845920149182]
We introduce a new dialog-level annotation workflow called Dialog Quality.
DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment.
We argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.
arXiv Detail & Related papers (2023-06-06T19:43:29Z) - Approximating Online Human Evaluation of Social Chatbots with Prompting [11.657633779338724]
Existing evaluation metrics aim to automate offline user evaluation and approximate human judgment of pre-curated dialogs.
We propose an approach to approximate online human evaluation leveraging large language models (LLMs) from the GPT family.
We introduce a new Dialog system Evaluation framework based on Prompting (DEP), which enables a fully automatic evaluation pipeline.
arXiv Detail & Related papers (2023-04-11T14:45:01Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - What is wrong with you?: Leveraging User Sentiment for Automatic Dialog
Evaluation [73.03318027164605]
We propose to use information that can be automatically extracted from the next user utterance as a proxy to measure the quality of the previous system response.
Our model generalizes across both spoken and written open-domain dialog corpora collected from real and paid users.
arXiv Detail & Related papers (2022-03-25T22:09:52Z) - User Response and Sentiment Prediction for Automatic Dialogue Evaluation [69.11124655437902]
We propose to use the sentiment of the next user utterance for turn or dialog level evaluation.
Experiments show our model outperforming existing automatic evaluation metrics on both written and spoken open-domain dialogue datasets.
arXiv Detail & Related papers (2021-11-16T22:19:17Z) - Actionable Conversational Quality Indicators for Improving Task-Oriented
Dialog Systems [2.6094079735487994]
This paper introduces and explains the use of Actionable Conversational Quality Indicators (ACQIs)
ACQIs are used both to recognize parts of dialogs that can be improved, and to recommend how to improve them.
We demonstrate the effectiveness of using ACQIs on LivePerson internal dialog systems used in commercial customer service applications.
arXiv Detail & Related papers (2021-09-22T22:41:42Z) - WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation
for Multi-turn Dialogue [17.663449579168297]
We simulate a dialogue between an agent and a user (modelled similar to an agent with supervised learning objective) to interact with each other.
The agent uses dynamic blocking to generate ranked diverse responses and exploration-exploitation to select among the Top-K responses.
Empirical studies with two benchmarks indicate that our model can significantly out-perform the response quality and lead to a successful conversation.
arXiv Detail & Related papers (2021-08-01T08:00:45Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - Joint Turn and Dialogue level User Satisfaction Estimation on
Multi-Domain Conversations [6.129731338249762]
Current automated methods to estimate turn and dialogue level user satisfaction employ hand-crafted features.
We propose a novel user satisfaction estimation approach which minimizes an adaptive multi-task loss function.
The BiLSTM based deep neural net model automatically weighs each turn's contribution towards the estimated dialogue-level rating.
arXiv Detail & Related papers (2020-10-06T05:53:13Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.