Related papers: BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning

BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning

URL: http://arxiv.org/abs/2211.05610v1
Date: Thu, 10 Nov 2022 14:37:23 GMT
Title: BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning
Authors: Mohsen Fayyaz, Ehsan Aghazadeh, Ali Modarressi, Mohammad Taher Pilehvar, Yadollah Yaghoobzadeh, Samira Ebrahimi Kahou
Abstract summary: We introduce GraNd and its estimated version, EL2N, as scoring metrics for finding important examples in a dataset. We show that by pruning a small portion of the examples with the highest GraNd/EL2N scores, we can not only preserve the test accuracy, but also surpass it.
Score: 20.404705741136777
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current pre-trained language models rely on large datasets for achieving state-of-the-art performance. However, past research has shown that not all examples in a dataset are equally important during training. In fact, it is sometimes possible to prune a considerable fraction of the training set while maintaining the test performance. Established on standard vision benchmarks, two gradient-based scoring metrics for finding important examples are GraNd and its estimated version, EL2N. In this work, we employ these two metrics for the first time in NLP. We demonstrate that these metrics need to be computed after at least one epoch of fine-tuning and they are not reliable in early steps. Furthermore, we show that by pruning a small portion of the examples with the highest GraNd/EL2N scores, we can not only preserve the test accuracy, but also surpass it. This paper details adjustments and implementation choices which enable GraNd and EL2N to be applied to NLP.

Related papers

Enhancing Understanding Through Wildlife Re-Identification [0.0]
We analyze the performance of multiple models on multiple datasets. We find that the usage of metrics trained for classification, then removing the output layer and using the second last layer as an embedding was not a successful strategy for learning. The DCNNS performed well on some datasets but poorly on others, which did not align with findings in previous literature. The LightGBM overfitted too heavily and was not significantly better than a constant model when trained and evaluated on all pairs using accuracy as a metric.
arXiv Detail & Related papers (2024-05-17T22:28:50Z)
Unsupervised Dense Retrieval with Relevance-Aware Contrastive Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning. We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks. Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z)
AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [64.9230895853942]
Domain generalization can be arbitrarily hard without exploiting target domain information. Test-time adaptive (TTA) methods are proposed to address this issue. In this work, we adopt Non-Parametric to perform the test-time Adaptation (AdaNPC)
arXiv Detail & Related papers (2023-04-25T04:23:13Z)
Augmenting NLP data to counter Annotation Artifacts for NLI Tasks [0.0]
Large pre-trained NLP models achieve high performance on benchmark datasets but do not actually "solve" the underlying task. We explore this phenomenon by first using contrast and adversarial examples to understand limitations to the model's performance. We then propose a data augmentation technique to fix this bias and measure its effectiveness.
arXiv Detail & Related papers (2023-02-09T15:34:53Z)
Prompt Consistency for Zero-Shot Task Generalization [118.81196556175797]
In this paper, we explore methods to utilize unlabeled data to improve zero-shot performance. Specifically, we take advantage of the fact that multiple prompts can be used to specify a single task, and propose to regularize prompt consistency. Our approach outperforms the state-of-the-art zero-shot learner, T0, on 9 out of 11 datasets across 4 NLP tasks by up to 10.6 absolute points in terms of accuracy.
arXiv Detail & Related papers (2022-04-29T19:18:37Z)
Fortunately, Discourse Markers Can Enhance Language Models for Sentiment Analysis [13.149482582098429]
We propose to leverage sentiment-carrying discourse markers to generate large-scale weakly-labeled data. We show the value of our approach on various benchmark datasets, including the finance domain.
arXiv Detail & Related papers (2022-01-06T12:33:47Z)
Deep Learning on a Data Diet: Finding Important Examples Early in Training [35.746302913918484]
In vision datasets, simple scores can be used to identify important examples very early in training. We propose two such scores -- the Gradient Normed (GraNd) and the Error L2-Norm (EL2N)
arXiv Detail & Related papers (2021-07-15T02:12:20Z)
Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation. We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)
A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics [70.45937234489044]
We re- organize two widely-used TSGV datasets (Charades-STA and ActivityNet Captions) to make it different from the training split. We introduce a new evaluation metric "dR@$n$,IoU@$m$" to calibrate the basic IoU scores. All the results demonstrate that the re-organized datasets and new metric can better monitor the progress in TSGV.
arXiv Detail & Related papers (2021-01-22T09:59:30Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.