How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for
Token-level Evaluation Metrics
- URL: http://arxiv.org/abs/2008.10427v1
- Date: Mon, 24 Aug 2020 13:28:35 GMT
- Title: How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for
Token-level Evaluation Metrics
- Authors: Prasanna Parthasarathi and Joelle Pineau and Sarath Chandar
- Abstract summary: generative dialogue modeling is widely seen as a language modeling task.
The task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user.
The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent.
- Score: 47.20761880464552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Though generative dialogue modeling is widely seen as a language modeling
task, the task demands an agent to have a complex natural language
understanding of its input text to carry a meaningful interaction with an user.
The automatic metrics used evaluate the quality of the generated text as a
proxy to the holistic interaction of the agent. Such metrics were earlier shown
to not correlate with the human judgement. In this work, we observe that human
evaluation of dialogue agents can be inconclusive due to the lack of sufficient
information for appropriate evaluation. The automatic metrics are deterministic
yet shallow and human evaluation can be relevant yet inconclusive. To bridge
this gap in evaluation, we propose designing a set of probing tasks to evaluate
dialogue models. The hand-crafted tasks are aimed at quantitatively evaluating
a generative dialogue model's understanding beyond the token-level evaluation
on the generated text. The probing tasks are deterministic like automatic
metrics and requires human judgement in their designing; benefiting from the
best of both worlds. With experiments on probe tasks we observe that, unlike
RNN based architectures, transformer model may not be learning to comprehend
the input text despite its generated text having higher overlap with the target
text.
Related papers
- What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation [57.550045763103334]
evaluating a story can be more challenging than other generation evaluation tasks.
We first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual.
We propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation.
arXiv Detail & Related papers (2024-08-26T20:35:42Z) - Synthetic Dialogue Dataset Generation using LLM Agents [7.933485970511388]
We develop two agents that "talk" to each other, one acting as the conversational agent, and the other acting as the user.
Using a set of text descriptions of linear problems from NL4Opt available to the user only, the agent and the user engage in conversation until the agent has retrieved all key information from the original problem description.
We conduct human and automatic evaluations, including an evaluation approach that uses GPT-4 to mimic the human evaluation metrics.
arXiv Detail & Related papers (2024-01-30T21:49:30Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - What is wrong with you?: Leveraging User Sentiment for Automatic Dialog
Evaluation [73.03318027164605]
We propose to use information that can be automatically extracted from the next user utterance as a proxy to measure the quality of the previous system response.
Our model generalizes across both spoken and written open-domain dialog corpora collected from real and paid users.
arXiv Detail & Related papers (2022-03-25T22:09:52Z) - Do Encoder Representations of Generative Dialogue Models Encode
Sufficient Information about the Task ? [41.36218215755317]
We showcase evaluating the text generated through human or automatic metrics is not sufficient to appropriately evaluate soundness of the language understanding of dialogue models.
We propose a set of probe tasks to evaluate encoder representation of different language encoders commonly used in dialogue models.
arXiv Detail & Related papers (2021-06-20T04:52:37Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z) - Designing Precise and Robust Dialogue Response Evaluators [35.137244385158034]
We propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained language models.
Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement.
arXiv Detail & Related papers (2020-04-10T04:59:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.