CheckDST: Measuring Real-World Generalization of Dialogue State Tracking
Performance
- URL: http://arxiv.org/abs/2112.08321v1
- Date: Wed, 15 Dec 2021 18:10:54 GMT
- Title: CheckDST: Measuring Real-World Generalization of Dialogue State Tracking
Performance
- Authors: Hyundong Cho, Chinnadhurai Sankar, Christopher Lin, Kaushik Ram
Sadagopan, Shahin Shayandeh, Asli Celikyilmaz, Jonathan May, Ahmad Beirami
- Abstract summary: We design a collection of metrics called CheckDST to test well-known weaknesses with augmented test sets.
We find that span-based classification models are resilient to unseen named entities but not robust to language variety.
Due to their respective weaknesses, neither approach is yet suitable for real-world deployment.
- Score: 18.936466253481363
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent neural models that extend the pretrain-then-finetune paradigm continue
to achieve new state-of-the-art results on joint goal accuracy (JGA) for
dialogue state tracking (DST) benchmarks. However, we call into question their
robustness as they show sharp drops in JGA for conversations containing
utterances or dialog flows with realistic perturbations. Inspired by CheckList
(Ribeiro et al., 2020), we design a collection of metrics called CheckDST that
facilitate comparisons of DST models on comprehensive dimensions of robustness
by testing well-known weaknesses with augmented test sets. We evaluate recent
DST models with CheckDST and argue that models should be assessed more
holistically rather than pursuing state-of-the-art on JGA since a higher JGA
does not guarantee better overall robustness. We find that span-based
classification models are resilient to unseen named entities but not robust to
language variety, whereas those based on autoregressive language models
generalize better to language variety but tend to memorize named entities and
often hallucinate. Due to their respective weaknesses, neither approach is yet
suitable for real-world deployment. We believe CheckDST is a useful guide for
future research to develop task-oriented dialogue models that embody the
strengths of various methods.
Related papers
- Fast and Accurate Factual Inconsistency Detection Over Long Documents [19.86348214462828]
We introduce SCALE, a task-agnostic model for detecting factual inconsistencies using a novel chunking strategy.
This approach achieves state-of-the-art performance in factual inconsistency detection for diverse tasks and long inputs.
We have released our code and data publicly to GitHub.
arXiv Detail & Related papers (2023-10-19T22:55:39Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Grounding Description-Driven Dialogue State Trackers with
Knowledge-Seeking Turns [54.56871462068126]
Augmenting the training set with human or synthetic schema paraphrases improves the model robustness to these variations but can be either costly or difficult to control.
We propose to circumvent these issues by grounding the state tracking model in knowledge-seeking turns collected from the dialogue corpus as well as the schema.
arXiv Detail & Related papers (2023-09-23T18:33:02Z) - ChatGPT for Zero-shot Dialogue State Tracking: A Solution or an
Opportunity? [2.3555053092246125]
We present preliminary experimental results on the ChatGPT research preview, showing that ChatGPT achieves state-of-the-art performance in zero-shot DST.
We theorize that the in-context learning capabilities of such models will likely become powerful tools to support the development of dedicated and dynamic dialogue state trackers.
arXiv Detail & Related papers (2023-06-02T09:15:01Z) - More Robust Schema-Guided Dialogue State Tracking via Tree-Based
Paraphrase Ranking [0.0]
Fine-tuned language models excel at building schema-guided dialogue state tracking (DST)
We propose a framework for generating synthetic schemas which uses tree-based ranking to jointly optimise diversity and semantic faithfulness.
arXiv Detail & Related papers (2023-03-17T11:43:08Z) - Stabilized In-Context Learning with Pre-trained Language Models for Few
Shot Dialogue State Tracking [57.92608483099916]
Large pre-trained language models (PLMs) have shown impressive unaided performance across many NLP tasks.
For more complex tasks such as dialogue state tracking (DST), designing prompts that reliably convey the desired intent is nontrivial.
We introduce a saliency model to limit dialogue text length, allowing us to include more exemplars per query.
arXiv Detail & Related papers (2023-02-12T15:05:10Z) - A Multi-Task BERT Model for Schema-Guided Dialogue State Tracking [78.2700757742992]
Task-oriented dialogue systems often employ a Dialogue State Tracker (DST) to successfully complete conversations.
Recent state-of-the-art DST implementations rely on schemata of diverse services to improve model robustness.
We propose a single multi-task BERT-based model that jointly solves the three DST tasks of intent prediction, requested slot prediction and slot filling.
arXiv Detail & Related papers (2022-07-02T13:27:59Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Annotation Inconsistency and Entity Bias in MultiWOZ [40.127114829948965]
MultiWOZ is one of the most popular multi-domain task-oriented dialog datasets.
It has been widely accepted as a benchmark for various dialog tasks, e.g., dialog state tracking (DST), natural language generation (NLG), and end-to-end (E2E) dialog modeling.
arXiv Detail & Related papers (2021-05-29T00:09:06Z) - RADDLE: An Evaluation Benchmark and Analysis Platform for Robust
Task-oriented Dialog Systems [75.87418236410296]
We introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains.
RADDLE is designed to favor and encourage models with a strong generalization ability.
We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain.
arXiv Detail & Related papers (2020-12-29T08:58:49Z) - A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided
Dialogue Dataset [8.990035371365408]
We introduce FastSGT, a fast and robust BERT-based model for state tracking in goal-oriented dialogue systems.
The proposed model is designed for theGuided Dialogue dataset which contains natural language descriptions.
Our model keeps the efficiency in terms of computational and memory consumption while improving the accuracy significantly.
arXiv Detail & Related papers (2020-08-27T18:51:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.