Related papers: Cost-Sensitive BERT for Generalisable Sentence Classification with Imbalanced Data

Cost-Sensitive BERT for Generalisable Sentence Classification with Imbalanced Data

URL: http://arxiv.org/abs/2003.11563v1
Date: Mon, 16 Mar 2020 19:10:57 GMT
Title: Cost-Sensitive BERT for Generalisable Sentence Classification with Imbalanced Data
Authors: Harish Tayyar Madabushi, Elena Kochkina, Michael Castelle
Abstract summary: We show that BERT does not generalise well when the training and test data are sufficiently dissimilar. We show how to address this problem by providing a statistical measure of similarity between datasets and a method of incorporating cost-weighting into BERT. We achieve the second-highest score on sentence-level propaganda classification.
Score: 5.08128537391027
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The automatic identification of propaganda has gained significance in recent years due to technological and social changes in the way news is generated and consumed. That this task can be addressed effectively using BERT, a powerful new architecture which can be fine-tuned for text classification tasks, is not surprising. However, propaganda detection, like other tasks that deal with news documents and other forms of decontextualized social communication (e.g. sentiment analysis), inherently deals with data whose categories are simultaneously imbalanced and dissimilar. We show that BERT, while capable of handling imbalanced classes with no additional data augmentation, does not generalise well when the training and test data are sufficiently dissimilar (as is often the case with news sources, whose topics evolve over time). We show how to address this problem by providing a statistical measure of similarity between datasets and a method of incorporating cost-weighting into BERT when the training and test sets are dissimilar. We test these methods on the Propaganda Techniques Corpus (PTC) and achieve the second-highest score on sentence-level propaganda classification.

Related papers

DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning [54.35107462768146]
Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. Existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. This paper proposes a novel dual-diversity enhancing and uncertainty-aware framework for CSAL.
arXiv Detail & Related papers (2025-02-01T04:00:03Z)
Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets [1.734165485480267]
We propose a new tool for automatically annotating text using written guidelines without providing training samples. Our results show that the prompt-based approach is comparable with the fine-tuned BERT but without any annotated training data. Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data.
arXiv Detail & Related papers (2024-06-26T10:44:02Z)
BERT Goes Off-Topic: Investigating the Domain Transfer Challenge using Genre Classification [0.27195102129095]
We show that classification tasks still suffer from a performance gap when the underlying distribution of topics changes. We quantify this phenomenon empirically with a large corpus and a large set of topics. We suggest and successfully test a possible remedy: after augmenting the training dataset with topically-controlled synthetic texts, the F1 score improves by up to 50% for some topics.
arXiv Detail & Related papers (2023-11-27T18:53:31Z)
JointMatch: A Unified Approach for Diverse and Collaborative Pseudo-Labeling to Semi-Supervised Text Classification [65.268245109828]
Semi-supervised text classification (SSTC) has gained increasing attention due to its ability to leverage unlabeled data. Existing approaches based on pseudo-labeling suffer from the issues of pseudo-label bias and error accumulation. We propose JointMatch, a holistic approach for SSTC that addresses these challenges by unifying ideas from recent semi-supervised learning.
arXiv Detail & Related papers (2023-10-23T05:43:35Z)
Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News Detection [50.07850264495737]
"Prompt-and-Align" (P&A) is a novel prompt-based paradigm for few-shot fake news detection. We show that P&A sets new states-of-the-art for few-shot fake news detection performance by significant margins.
arXiv Detail & Related papers (2023-09-28T13:19:43Z)
Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks [3.703767478524629]
"Noisy" self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against adversarial attacks. We evaluate our experiments on two offensive/hate-speech datasets and demonstrate that (i) self-training consistently improves performance regardless of model size, resulting in up to +1.5% F1-macro on both datasets, and (ii) noisy self-training with textual data augmentations, despite being successfully applied in similar settings, decreases performance on offensive and hate-speech domains when compared to the default method, even with state-of-the-art augmentations such as backtranslation.
arXiv Detail & Related papers (2023-07-31T12:35:54Z)
WC-SBERT: Zero-Shot Text Classification via SBERT with Self-Training for Wikipedia Categories [5.652290685410878]
Our research focuses on solving the zero-shot text classification problem in NLP. We propose a novel self-training strategy that uses labels rather than text for training. Our method achieves state-of-the-art results on both the Yahoo Topic and AG News datasets.
arXiv Detail & Related papers (2023-07-28T04:17:41Z)
Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z)
Like a Good Nearest Neighbor: Practical Content Moderation and Text Classification [66.02091763340094]
Like a Good Nearest Neighbor (LaGoNN) is a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor. LaGoNN is effective at flagging undesirable content and text classification, and improves the performance of SetFit.
arXiv Detail & Related papers (2023-02-17T15:43:29Z)
InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings [61.77760317554826]
This paper proposes an information-d contrastive learning framework for learning unsupervised sentence embeddings, termed InfoCSE. We evaluate the proposed InfoCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that InfoCSE outperforms SimCSE by an average Spearman correlation of 2.60% on BERT-base, and 1.77% on BERT-large.
arXiv Detail & Related papers (2022-10-08T15:53:19Z)
Rating Facts under Coarse-to-fine Regimes [0.533024001730262]
We collect 24K manually rated statements from PolitiFact. Our task represents a twist from standard classification, due to the various degrees of similarity between classes. After training, class similarity is sensible over the multi-class datasets, especially in the fine-grained one.
arXiv Detail & Related papers (2021-07-13T13:05:11Z)
Semi-Supervised Models via Data Augmentationfor Classifying Interactive Affective Responses [85.04362095899656]
We present semi-supervised models with data augmentation (SMDA), a semi-supervised text classification system to classify interactive affective responses. For labeled sentences, we performed data augmentations to uniform the label distributions and computed supervised loss during training process. For unlabeled sentences, we explored self-training by regarding low-entropy predictions over unlabeled sentences as pseudo labels.
arXiv Detail & Related papers (2020-04-23T05:02:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.