RADDLE: An Evaluation Benchmark and Analysis Platform for Robust
Task-oriented Dialog Systems
- URL: http://arxiv.org/abs/2012.14666v1
- Date: Tue, 29 Dec 2020 08:58:49 GMT
- Title: RADDLE: An Evaluation Benchmark and Analysis Platform for Robust
Task-oriented Dialog Systems
- Authors: Baolin Peng, Chunyuan Li, Zhu Zhang, Chenguang Zhu, Jinchao Li,
Jianfeng Gao
- Abstract summary: We introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains.
RADDLE is designed to favor and encourage models with a strong generalization ability.
We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain.
- Score: 75.87418236410296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For task-oriented dialog systems to be maximally useful, it must be able to
process conversations in a way that is (1) generalizable with a small number of
training examples for new task domains, and (2) robust to user input in various
styles, modalities or domains. In pursuit of these goals, we introduce the
RADDLE benchmark, a collection of corpora and tools for evaluating the
performance of models across a diverse set of domains. By including tasks with
limited training data, RADDLE is designed to favor and encourage models with a
strong generalization ability. RADDLE also includes a diagnostic checklist that
facilitates detailed robustness analysis in aspects such as language
variations, speech errors, unseen entities, and out-of-domain utterances. We
evaluate recent state-of-the-art systems based on pre-training and fine-tuning,
and find that grounded pre-training on heterogeneous dialog corpora performs
better than training a separate model per domain. Overall, existing models are
less than satisfactory in robustness evaluation, which suggests opportunities
for future improvement.
Related papers
- R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models [51.468732121824125]
Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems.
Existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge.
In this paper, we address the challenges of evaluating RALLMs by introducing the R-Eval toolkit, a Python toolkit designed to streamline the evaluation of different RAGs.
arXiv Detail & Related papers (2024-06-17T15:59:49Z) - A Large-Scale Evaluation of Speech Foundation Models [110.95827399522204]
We establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the foundation model paradigm for speech.
We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads.
arXiv Detail & Related papers (2024-04-15T00:03:16Z) - Zero-Shot Generalizable End-to-End Task-Oriented Dialog System using
Context Summarization and Domain Schema [2.7178968279054936]
State-of-the-art approaches in task-oriented dialog systems formulate the problem as a conditional sequence generation task.
This requires labeled training data for each new domain or task.
We introduce a novel Zero-Shot generalizable end-to-end Task-oriented Dialog system, ZS-ToD.
arXiv Detail & Related papers (2023-03-28T18:56:31Z) - Stabilized In-Context Learning with Pre-trained Language Models for Few
Shot Dialogue State Tracking [57.92608483099916]
Large pre-trained language models (PLMs) have shown impressive unaided performance across many NLP tasks.
For more complex tasks such as dialogue state tracking (DST), designing prompts that reliably convey the desired intent is nontrivial.
We introduce a saliency model to limit dialogue text length, allowing us to include more exemplars per query.
arXiv Detail & Related papers (2023-02-12T15:05:10Z) - DIONYSUS: A Pre-trained Model for Low-Resource Dialogue Summarization [127.714919036388]
DIONYSUS is a pre-trained encoder-decoder model for summarizing dialogues in any new domain.
Our experiments show that DIONYSUS outperforms existing methods on six datasets.
arXiv Detail & Related papers (2022-12-20T06:21:21Z) - DiSTRICT: Dialogue State Tracking with Retriever Driven In-Context
Tuning [7.5700317050237365]
We propose DiSTRICT, a generalizable in-context tuning approach for Dialogue State Tracking (DST)
DSTRICT retrieves highly relevant training examples for a given dialogue to fine-tune the model without any hand-crafted templates.
Experiments with the MultiWOZ benchmark datasets show that DiSTRICT outperforms existing approaches in various zero-shot and few-shot settings.
arXiv Detail & Related papers (2022-12-06T09:40:15Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - Representation Learning for Conversational Data using Discourse Mutual
Information Maximization [9.017156603976915]
We argue that the structure-unaware word-by-word generation is not suitable for effective conversation modeling.
We propose a structure-aware Mutual Information based loss-function DMI for training dialog-representation models.
Our models show the most promising performance on the dialog evaluation task DailyDialog++, in both random and adversarial negative scenarios.
arXiv Detail & Related papers (2021-12-04T13:17:07Z) - Self-training Improves Pre-training for Few-shot Learning in
Task-oriented Dialog Systems [47.937191088981436]
Large-scale pre-trained language models, have shown promising results for few-shot learning in ToD.
We propose a self-training approach that iteratively labels the most confident unlabeled data to train a stronger Student model.
We conduct experiments and present analyses on four downstream tasks in ToD, including intent classification, dialog state tracking, dialog act prediction, and response selection.
arXiv Detail & Related papers (2021-08-28T07:22:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.