An Evaluation Dataset and Strategy for Building Robust Multi-turn
Response Selection Model
- URL: http://arxiv.org/abs/2109.04834v1
- Date: Fri, 10 Sep 2021 12:36:13 GMT
- Title: An Evaluation Dataset and Strategy for Building Robust Multi-turn
Response Selection Model
- Authors: Kijong Han, Seojin Lee, Wooin Lee, Joosung Lee, Dong-hun Lee
- Abstract summary: Multi-turn response selection models have recently shown comparable performance to humans in several benchmark datasets.
In the real environment, these models often have weaknesses, such as making incorrect predictions based heavily on superficial patterns.
- Score: 3.20238141000059
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-turn response selection models have recently shown comparable
performance to humans in several benchmark datasets. However, in the real
environment, these models often have weaknesses, such as making incorrect
predictions based heavily on superficial patterns without a comprehensive
understanding of the context. For example, these models often give a high score
to the wrong response candidate containing several keywords related to the
context but using the inconsistent tense. In this study, we analyze the
weaknesses of the open-domain Korean Multi-turn response selection models and
publish an adversarial dataset to evaluate these weaknesses. We also suggest a
strategy to build a robust model in this adversarial environment.
Related papers
- A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios [5.617202699068449]
We evaluate the robustness of several large language models on multiple datasets.
Benchmark datasets are constructed by introducing naturally-preserving, non-malicious perturbations.
arXiv Detail & Related papers (2024-08-04T08:43:09Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Deep Neural Network Benchmarks for Selective Classification [27.098996474946446]
Multiple selective classification frameworks exist, most of which rely on deep neural network architectures.
We evaluate these approaches using several criteria, including selective error rate, empirical coverage, distribution of rejected instance's classes, and performance on out-of-distribution instances.
arXiv Detail & Related papers (2024-01-23T12:15:47Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Next-Year Bankruptcy Prediction from Textual Data: Benchmark and
Baselines [10.944533132358439]
Models for bankruptcy prediction are useful in several real-world scenarios.
The lack of a common benchmark dataset and evaluation strategy impedes the objective comparison between models.
This paper introduces such a benchmark for the unstructured data scenario, based on novel and established datasets.
arXiv Detail & Related papers (2022-08-24T07:11:49Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Synthesizing Adversarial Negative Responses for Robust Response Ranking
and Evaluation [34.52276336319678]
Open-domain neural dialogue models have achieved high performance in response ranking and evaluation tasks.
Over-reliance on content similarity makes the models less sensitive to the presence of inconsistencies.
We propose approaches for automatically creating adversarial negative training data.
arXiv Detail & Related papers (2021-06-10T16:20:55Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.