Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset
Evaluation for Text Classification
- URL: http://arxiv.org/abs/2205.02129v1
- Date: Wed, 4 May 2022 15:33:00 GMT
- Title: Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset
Evaluation for Text Classification
- Authors: Yang Xiao, Jinlan Fu, See-Kiong Ng, Pengfei Liu
- Abstract summary: In this paper, we ask the research question of whether all the datasets in the benchmark are necessary.
Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power.
- Score: 39.01740345482624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we ask the research question of whether all the datasets in
the benchmark are necessary. We approach this by first characterizing the
distinguishability of datasets when comparing different systems. Experiments on
9 datasets and 36 systems show that several existing benchmark datasets
contribute little to discriminating top-scoring systems, while those less used
datasets exhibit impressive discriminative power. We further, taking the text
classification task as a case study, investigate the possibility of predicting
dataset discrimination based on its properties (e.g., average sentence length).
Our preliminary experiments promisingly show that given a sufficient number of
training experimental records, a meaningful predictor can be learned to
estimate dataset discrimination over unseen datasets. We released all datasets
with features explored in this work on DataLab:
\url{https://datalab.nlpedia.ai}.
Related papers
- A Suite of Fairness Datasets for Tabular Classification [2.0813318162800707]
We introduce a suite of functions for fetching 20 fairness datasets and providing associated fairness metadata.
Hopefully, these will lead to more rigorous experimental evaluations in future fairness-aware machine learning research.
arXiv Detail & Related papers (2023-07-31T19:58:12Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - DataFinder: Scientific Dataset Recommendation from Natural Language
Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description.
To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set.
This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z) - A Bag-of-Prototypes Representation for Dataset-Level Applications [24.629132557336312]
This work investigates dataset vectorization for two dataset-level tasks: assessing training set suitability and test set difficulty.
We propose a bag-of-prototypes (BoP) dataset representation that extends the image-level bag consisting of patch descriptors to dataset-level bag consisting of semantic prototypes.
BoP consistently shows its advantage over existing representations on a series of benchmarks for two dataset-level tasks.
arXiv Detail & Related papers (2023-03-23T13:33:58Z) - Unsupervised Anomaly Detection for Auditing Data and Impact of
Categorical Encodings [20.37092575427039]
Vehicle Claims dataset consists of fraudulent insurance claims for automotive repairs.
We tackle the common problem of missing benchmark datasets for anomaly detection.
The dataset is evaluated on shallow and deep learning methods.
arXiv Detail & Related papers (2022-10-25T14:33:17Z) - Is margin all you need? An extensive empirical study of active learning
on tabular data [66.18464006872345]
We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark.
Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
arXiv Detail & Related papers (2022-10-07T21:18:24Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - Dominant Set-based Active Learning for Text Classification and its
Application to Online Social Media [0.0]
We present a novel pool-based active learning method for the training of large unlabeled corpus with minimum annotation cost.
Our proposed method does not have any parameters to be tuned, making it dataset-independent.
Our method achieves a higher performance in comparison to the state-of-the-art active learning strategies.
arXiv Detail & Related papers (2022-01-28T19:19:03Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.