BERT on a Data Diet: Finding Important Examples by Gradient-Based
Pruning
- URL: http://arxiv.org/abs/2211.05610v1
- Date: Thu, 10 Nov 2022 14:37:23 GMT
- Title: BERT on a Data Diet: Finding Important Examples by Gradient-Based
Pruning
- Authors: Mohsen Fayyaz, Ehsan Aghazadeh, Ali Modarressi, Mohammad Taher
Pilehvar, Yadollah Yaghoobzadeh, Samira Ebrahimi Kahou
- Abstract summary: We introduce GraNd and its estimated version, EL2N, as scoring metrics for finding important examples in a dataset.
We show that by pruning a small portion of the examples with the highest GraNd/EL2N scores, we can not only preserve the test accuracy, but also surpass it.
- Score: 20.404705741136777
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current pre-trained language models rely on large datasets for achieving
state-of-the-art performance. However, past research has shown that not all
examples in a dataset are equally important during training. In fact, it is
sometimes possible to prune a considerable fraction of the training set while
maintaining the test performance. Established on standard vision benchmarks,
two gradient-based scoring metrics for finding important examples are GraNd and
its estimated version, EL2N. In this work, we employ these two metrics for the
first time in NLP. We demonstrate that these metrics need to be computed after
at least one epoch of fine-tuning and they are not reliable in early steps.
Furthermore, we show that by pruning a small portion of the examples with the
highest GraNd/EL2N scores, we can not only preserve the test accuracy, but also
surpass it. This paper details adjustments and implementation choices which
enable GraNd and EL2N to be applied to NLP.
Related papers
- Enhancing Understanding Through Wildlife Re-Identification [0.0]
We analyze the performance of multiple models on multiple datasets.
We find that the usage of metrics trained for classification, then removing the output layer and using the second last layer as an embedding was not a successful strategy for learning.
The DCNNS performed well on some datasets but poorly on others, which did not align with findings in previous literature.
The LightGBM overfitted too heavily and was not significantly better than a constant model when trained and evaluated on all pairs using accuracy as a metric.
arXiv Detail & Related papers (2024-05-17T22:28:50Z) - Unsupervised Dense Retrieval with Relevance-Aware Contrastive
Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning.
We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks.
Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z) - AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [64.9230895853942]
Domain generalization can be arbitrarily hard without exploiting target domain information.
Test-time adaptive (TTA) methods are proposed to address this issue.
In this work, we adopt Non-Parametric to perform the test-time Adaptation (AdaNPC)
arXiv Detail & Related papers (2023-04-25T04:23:13Z) - Augmenting NLP data to counter Annotation Artifacts for NLI Tasks [0.0]
Large pre-trained NLP models achieve high performance on benchmark datasets but do not actually "solve" the underlying task.
We explore this phenomenon by first using contrast and adversarial examples to understand limitations to the model's performance.
We then propose a data augmentation technique to fix this bias and measure its effectiveness.
arXiv Detail & Related papers (2023-02-09T15:34:53Z) - Prompt Consistency for Zero-Shot Task Generalization [118.81196556175797]
In this paper, we explore methods to utilize unlabeled data to improve zero-shot performance.
Specifically, we take advantage of the fact that multiple prompts can be used to specify a single task, and propose to regularize prompt consistency.
Our approach outperforms the state-of-the-art zero-shot learner, T0, on 9 out of 11 datasets across 4 NLP tasks by up to 10.6 absolute points in terms of accuracy.
arXiv Detail & Related papers (2022-04-29T19:18:37Z) - Fortunately, Discourse Markers Can Enhance Language Models for Sentiment
Analysis [13.149482582098429]
We propose to leverage sentiment-carrying discourse markers to generate large-scale weakly-labeled data.
We show the value of our approach on various benchmark datasets, including the finance domain.
arXiv Detail & Related papers (2022-01-06T12:33:47Z) - Deep Learning on a Data Diet: Finding Important Examples Early in
Training [35.746302913918484]
In vision datasets, simple scores can be used to identify important examples very early in training.
We propose two such scores -- the Gradient Normed (GraNd) and the Error L2-Norm (EL2N)
arXiv Detail & Related papers (2021-07-15T02:12:20Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - A Closer Look at Temporal Sentence Grounding in Videos: Datasets and
Metrics [70.45937234489044]
We re- organize two widely-used TSGV datasets (Charades-STA and ActivityNet Captions) to make it different from the training split.
We introduce a new evaluation metric "dR@$n$,IoU@$m$" to calibrate the basic IoU scores.
All the results demonstrate that the re-organized datasets and new metric can better monitor the progress in TSGV.
arXiv Detail & Related papers (2021-01-22T09:59:30Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.