Identifying Untrustworthy Samples: Data Filtering for Open-domain
Dialogues with Bayesian Optimization
- URL: http://arxiv.org/abs/2109.06471v1
- Date: Tue, 14 Sep 2021 06:42:54 GMT
- Title: Identifying Untrustworthy Samples: Data Filtering for Open-domain
Dialogues with Bayesian Optimization
- Authors: Lei Shen, Haolan Zhan, Xin Shen, Hongshen Chen, Xiaofang Zhao and
Xiaodan Zhu
- Abstract summary: We present a data filtering method for open-domain dialogues.
We score training samples with a quality measure, sort them in descending order, and filter out those at the bottom.
Experimental results on two datasets show that our method can effectively identify untrustworthy samples.
- Score: 28.22184410167622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Being able to reply with a related, fluent, and informative response is an
indispensable requirement for building high-quality conversational agents. In
order to generate better responses, some approaches have been proposed, such as
feeding extra information by collecting large-scale datasets with human
annotations, designing neural conversational models (NCMs) with complex
architecture and loss functions, or filtering out untrustworthy samples based
on a dialogue attribute, e.g., Relatedness or Genericness. In this paper, we
follow the third research branch and present a data filtering method for
open-domain dialogues, which identifies untrustworthy samples from training
data with a quality measure that linearly combines seven dialogue attributes.
The attribute weights are obtained via Bayesian Optimization (BayesOpt) that
aims to optimize an objective function for dialogue generation iteratively on
the validation set. Then we score training samples with the quality measure,
sort them in descending order, and filter out those at the bottom. Furthermore,
to accelerate the "filter-train-evaluate" iterations involved in BayesOpt on
large-scale datasets, we propose a training framework that integrates maximum
likelihood estimation (MLE) and negative training method (NEG). The training
method updates parameters of a trained NCMs on two small sets with newly
maintained and removed samples, respectively. Specifically, MLE is applied to
maximize the log-likelihood of newly maintained samples, while NEG is used to
minimize the log-likelihood of newly removed ones. Experimental results on two
datasets show that our method can effectively identify untrustworthy samples,
and NCMs trained on the filtered datasets achieve better performance.
Related papers
- RECOST: External Knowledge Guided Data-efficient Instruction Tuning [25.985023475991625]
We argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset.
We propose a framework dubbed as textbfRECOST, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline.
arXiv Detail & Related papers (2024-02-27T09:47:36Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem.
We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z) - A Data Cartography based MixUp for Pre-trained Language Models [47.90235939359225]
MixUp is a data augmentation strategy where additional samples are generated during training by combining random pairs of training samples and their labels.
We propose TDMixUp, a novel MixUp strategy that leverages Training Dynamics and allows more informative samples to be combined for generating new data samples.
We empirically validate that our method not only achieves competitive performance using a smaller subset of the training data compared with strong baselines, but also yields lower expected calibration error on the pre-trained language model, BERT, on both in-domain and out-of-domain settings in a wide range of NLP tasks.
arXiv Detail & Related papers (2022-05-06T17:59:19Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Learning from Mistakes: Combining Ontologies via Self-Training for
Dialogue Generation [6.221019624345408]
Natural language generators (NLGs) for task-oriented dialogue typically take a meaning representation (MR) as input.
We create a new, larger combined ontology, and then train an NLG to produce utterances covering it.
For example, if one dataset has attributes for family-friendly and rating information, and the other has attributes for decor and service, our aim is an NLG for the combined ontology that can produce utterances that realize values for family-friendly, rating, decor and service.
arXiv Detail & Related papers (2020-09-30T23:54:38Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.