Cost-Sensitive BERT for Generalisable Sentence Classification with
Imbalanced Data
- URL: http://arxiv.org/abs/2003.11563v1
- Date: Mon, 16 Mar 2020 19:10:57 GMT
- Title: Cost-Sensitive BERT for Generalisable Sentence Classification with
Imbalanced Data
- Authors: Harish Tayyar Madabushi, Elena Kochkina, Michael Castelle
- Abstract summary: We show that BERT does not generalise well when the training and test data are sufficiently dissimilar.
We show how to address this problem by providing a statistical measure of similarity between datasets and a method of incorporating cost-weighting into BERT.
We achieve the second-highest score on sentence-level propaganda classification.
- Score: 5.08128537391027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The automatic identification of propaganda has gained significance in recent
years due to technological and social changes in the way news is generated and
consumed. That this task can be addressed effectively using BERT, a powerful
new architecture which can be fine-tuned for text classification tasks, is not
surprising. However, propaganda detection, like other tasks that deal with news
documents and other forms of decontextualized social communication (e.g.
sentiment analysis), inherently deals with data whose categories are
simultaneously imbalanced and dissimilar. We show that BERT, while capable of
handling imbalanced classes with no additional data augmentation, does not
generalise well when the training and test data are sufficiently dissimilar (as
is often the case with news sources, whose topics evolve over time). We show
how to address this problem by providing a statistical measure of similarity
between datasets and a method of incorporating cost-weighting into BERT when
the training and test sets are dissimilar. We test these methods on the
Propaganda Techniques Corpus (PTC) and achieve the second-highest score on
sentence-level propaganda classification.
Related papers
- Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets [1.734165485480267]
We propose a new tool for automatically annotating text using written guidelines without providing training samples.
Our results show that the prompt-based approach is comparable with the fine-tuned BERT but without any annotated training data.
Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data.
arXiv Detail & Related papers (2024-06-26T10:44:02Z) - BERT Goes Off-Topic: Investigating the Domain Transfer Challenge using
Genre Classification [0.27195102129095]
We show that classification tasks still suffer from a performance gap when the underlying distribution of topics changes.
We quantify this phenomenon empirically with a large corpus and a large set of topics.
We suggest and successfully test a possible remedy: after augmenting the training dataset with topically-controlled synthetic texts, the F1 score improves by up to 50% for some topics.
arXiv Detail & Related papers (2023-11-27T18:53:31Z) - JointMatch: A Unified Approach for Diverse and Collaborative
Pseudo-Labeling to Semi-Supervised Text Classification [65.268245109828]
Semi-supervised text classification (SSTC) has gained increasing attention due to its ability to leverage unlabeled data.
Existing approaches based on pseudo-labeling suffer from the issues of pseudo-label bias and error accumulation.
We propose JointMatch, a holistic approach for SSTC that addresses these challenges by unifying ideas from recent semi-supervised learning.
arXiv Detail & Related papers (2023-10-23T05:43:35Z) - Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News
Detection [50.07850264495737]
"Prompt-and-Align" (P&A) is a novel prompt-based paradigm for few-shot fake news detection.
We show that P&A sets new states-of-the-art for few-shot fake news detection performance by significant margins.
arXiv Detail & Related papers (2023-09-28T13:19:43Z) - Noisy Self-Training with Data Augmentations for Offensive and Hate
Speech Detection Tasks [3.703767478524629]
"Noisy" self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against adversarial attacks.
We evaluate our experiments on two offensive/hate-speech datasets and demonstrate that (i) self-training consistently improves performance regardless of model size, resulting in up to +1.5% F1-macro on both datasets, and (ii) noisy self-training with textual data augmentations, despite being successfully applied in similar settings, decreases performance on offensive and hate-speech domains when compared to the default method, even with state-of-the-art augmentations such as backtranslation.
arXiv Detail & Related papers (2023-07-31T12:35:54Z) - WC-SBERT: Zero-Shot Text Classification via SBERT with Self-Training for
Wikipedia Categories [5.652290685410878]
Our research focuses on solving the zero-shot text classification problem in NLP.
We propose a novel self-training strategy that uses labels rather than text for training.
Our method achieves state-of-the-art results on both the Yahoo Topic and AG News datasets.
arXiv Detail & Related papers (2023-07-28T04:17:41Z) - Verifying the Robustness of Automatic Credibility Assessment [79.08422736721764]
Text classification methods have been widely investigated as a way to detect content of low credibility.
In some cases insignificant changes in input text can mislead the models.
We introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Like a Good Nearest Neighbor: Practical Content Moderation and Text
Classification [66.02091763340094]
Like a Good Nearest Neighbor (LaGoNN) is a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor.
LaGoNN is effective at flagging undesirable content and text classification, and improves the performance of SetFit.
arXiv Detail & Related papers (2023-02-17T15:43:29Z) - InfoCSE: Information-aggregated Contrastive Learning of Sentence
Embeddings [61.77760317554826]
This paper proposes an information-d contrastive learning framework for learning unsupervised sentence embeddings, termed InfoCSE.
We evaluate the proposed InfoCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task.
Experimental results show that InfoCSE outperforms SimCSE by an average Spearman correlation of 2.60% on BERT-base, and 1.77% on BERT-large.
arXiv Detail & Related papers (2022-10-08T15:53:19Z) - Rating Facts under Coarse-to-fine Regimes [0.533024001730262]
We collect 24K manually rated statements from PolitiFact.
Our task represents a twist from standard classification, due to the various degrees of similarity between classes.
After training, class similarity is sensible over the multi-class datasets, especially in the fine-grained one.
arXiv Detail & Related papers (2021-07-13T13:05:11Z) - Semi-Supervised Models via Data Augmentationfor Classifying Interactive
Affective Responses [85.04362095899656]
We present semi-supervised models with data augmentation (SMDA), a semi-supervised text classification system to classify interactive affective responses.
For labeled sentences, we performed data augmentations to uniform the label distributions and computed supervised loss during training process.
For unlabeled sentences, we explored self-training by regarding low-entropy predictions over unlabeled sentences as pseudo labels.
arXiv Detail & Related papers (2020-04-23T05:02:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.