Identifying Semantically Difficult Samples to Improve Text
Classification
- URL: http://arxiv.org/abs/2302.06155v1
- Date: Mon, 13 Feb 2023 07:33:46 GMT
- Title: Identifying Semantically Difficult Samples to Improve Text
Classification
- Authors: Shashank Mujumdar, Stuti Mehta, Hima Patel, Suman Mitra
- Abstract summary: We investigate the effect of addressing difficult samples from a given text dataset on the downstream text classification task.
We define difficult samples as being non-obvious cases for text classification by analysing them in the semantic embedding space.
We conduct exhaustive experiments on 13 standard datasets to show a consistent improvement of up to 9%.
- Score: 4.545971444299925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate the effect of addressing difficult samples from
a given text dataset on the downstream text classification task. We define
difficult samples as being non-obvious cases for text classification by
analysing them in the semantic embedding space; specifically - (i) semantically
similar samples that belong to different classes and (ii) semantically
dissimilar samples that belong to the same class. We propose a penalty function
to measure the overall difficulty score of every sample in the dataset. We
conduct exhaustive experiments on 13 standard datasets to show a consistent
improvement of up to 9% and discuss qualitative results to show effectiveness
of our approach in identifying difficult samples for a text classification
model.
Related papers
- Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing [71.29488677105127]
Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters.
We propose a contrastive learning-based STR framework by leveraging synthetic and real unlabeled data without any human cost.
Our method achieves SOTA performance (94.7% and 70.9% average accuracy on common benchmarks and Union14M-Benchmark.
arXiv Detail & Related papers (2024-11-23T15:24:47Z) - Detecting Statements in Text: A Domain-Agnostic Few-Shot Solution [1.3654846342364308]
State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce.
We propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task.
We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection.
arXiv Detail & Related papers (2024-05-09T12:03:38Z) - Differences Between Hard and Noisy-labeled Samples: An Empirical Study [7.132368785057315]
noisy or incorrectly labeled samples from a labeled dataset with hard/difficult samples is an important yet under-explored topic.
We introduce a simple yet effective metric that filters out noisy-labeled samples while keeping the hard samples.
Our proposed data partitioning method significantly outperforms other methods when employed within a semi-supervised learning framework.
arXiv Detail & Related papers (2023-07-20T09:24:23Z) - Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot
Text Classification Tasks [75.42002070547267]
We propose a self evolution learning (SE) based mixup approach for data augmentation in text classification.
We introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up.
arXiv Detail & Related papers (2023-05-22T23:43:23Z) - SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised
Learning [101.86916775218403]
This paper revisits the popular pseudo-labeling methods via a unified sample weighting formulation.
We propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training.
In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
arXiv Detail & Related papers (2023-01-26T03:53:25Z) - Text sampling strategies for predicting missing bibliographic links [0.0]
The paper proposes various strategies for sampling text when performing automatic sentence classification.
We examine a number of sampling strategies that differ in context size and position.
This method of detecting missing links can be used in recommendation engines of applied intelligent information systems.
arXiv Detail & Related papers (2023-01-04T15:53:50Z) - Leveraging Ensembles and Self-Supervised Learning for Fully-Unsupervised
Person Re-Identification and Text Authorship Attribution [77.85461690214551]
Learning from fully-unlabeled data is challenging in Multimedia Forensics problems, such as Person Re-Identification and Text Authorship Attribution.
Recent self-supervised learning methods have shown to be effective when dealing with fully-unlabeled data in cases where the underlying classes have significant semantic differences.
We propose a strategy to tackle Person Re-Identification and Text Authorship Attribution by enabling learning from unlabeled data even when samples from different classes are not prominently diverse.
arXiv Detail & Related papers (2022-02-07T13:08:11Z) - Assessing the Quality of the Datasets by Identifying Mislabeled Samples [14.881597737762316]
We propose a novel statistic -- noise score -- as a measure for the quality of each data point to identify mislabeled samples.
In our work, we use the representations derived by the inference network of data quality supervised variational autoencoder (AQUAVS)
We validate our proposed statistic through experimentation by corrupting MNIST, FashionMNIST, and CIFAR10/100 datasets.
arXiv Detail & Related papers (2021-09-10T17:14:09Z) - Constructing Contrastive samples via Summarization for Text
Classification with limited annotations [46.53641181501143]
We propose a novel approach to constructing contrastive samples for language tasks using text summarization.
We use these samples for supervised contrastive learning to gain better text representations with limited annotations.
Experiments on real-world text classification datasets (Amazon-5, Yelp-5, AG News) demonstrate the effectiveness of the proposed contrastive learning framework.
arXiv Detail & Related papers (2021-04-11T20:13:24Z) - Debiased Contrastive Learning [64.98602526764599]
We develop a debiased contrastive objective that corrects for the sampling of same-label datapoints.
Empirically, the proposed objective consistently outperforms the state-of-the-art for representation learning in vision, language, and reinforcement learning benchmarks.
arXiv Detail & Related papers (2020-07-01T04:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.