RankAug: Augmented data ranking for text classification
- URL: http://arxiv.org/abs/2311.04535v1
- Date: Wed, 8 Nov 2023 08:47:49 GMT
- Title: RankAug: Augmented data ranking for text classification
- Authors: Tiasa Singha Roy and Priyam Basu
- Abstract summary: RankAug is a text-ranking approach that detects and filters out the top augmented texts.
We demonstrate that the judicious selection of filtering techniques can yield a substantial improvement of up to 35% in classification accuracy for under-represented classes.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Research on data generation and augmentation has been focused majorly on
enhancing generation models, leaving a notable gap in the exploration and
refinement of methods for evaluating synthetic data. There are several text
similarity metrics within the context of generated data filtering which can
impact the performance of specific Natural Language Understanding (NLU) tasks,
specifically focusing on intent and sentiment classification. In this study, we
propose RankAug, a text-ranking approach that detects and filters out the top
augmented texts in terms of being most similar in meaning with lexical and
syntactical diversity. Through experiments conducted on multiple datasets, we
demonstrate that the judicious selection of filtering techniques can yield a
substantial improvement of up to 35% in classification accuracy for
under-represented classes.
Related papers
- ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws [67.59263833387536]
ScalingFilter is a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data.
To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations.
arXiv Detail & Related papers (2024-08-15T17:59:30Z) - Classification and Clustering of Sentence-Level Embeddings of Scientific Articles Generated by Contrastive Learning [1.104960878651584]
Our approach consists of fine-tuning transformer language models to generate sentence-level embeddings from scientific articles.
We trained our models on three datasets with contrastive learning.
We show that fine-tuning sentence transformers with contrastive learning and using the generated embeddings in downstream tasks is a feasible approach to sentence classification in scientific articles.
arXiv Detail & Related papers (2024-03-30T02:52:14Z) - Selective Text Augmentation with Word Roles for Low-Resource Text
Classification [3.4806267677524896]
Different words may play different roles in text classification, which inspires us to strategically select the proper roles for text augmentation.
In this work, we first identify the relationships between the words in a text and the text category from the perspectives of statistical correlation and semantic similarity.
We present a new augmentation technique called STA (Selective Text Augmentation) where different text-editing operations are selectively applied to words with specific roles.
arXiv Detail & Related papers (2022-09-04T08:13:11Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Guiding Generative Language Models for Data Augmentation in Few-Shot
Text Classification [59.698811329287174]
We leverage GPT-2 for generating artificial training instances in order to improve classification performance.
Our results show that fine-tuning GPT-2 in a handful of label instances leads to consistent classification improvements.
arXiv Detail & Related papers (2021-11-17T12:10:03Z) - What Have Been Learned & What Should Be Learned? An Empirical Study of
How to Selectively Augment Text for Classification [0.0]
We propose STA (Selective Text Augmentation) to selectively augment the text, where the informative, class-indicating words are emphasized but the irrelevant or noisy words are diminished.
Experiments on four English and Chinese text classification benchmark datasets demonstrate that STA can substantially outperform the non-selective text augmentation methods.
arXiv Detail & Related papers (2021-09-01T04:03:11Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z) - GenAug: Data Augmentation for Finetuning Text Generators [21.96895115572357]
We propose and evaluate various augmentation methods, including some that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp Reviews.
Our experiments demonstrate that insertion of character-level synthetic noise and keyword replacement with hypernyms are effective augmentation methods.
arXiv Detail & Related papers (2020-10-05T05:46:39Z) - A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning.
This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021.
We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z) - Deep Learning feature selection to unhide demographic recommender
systems factors [63.732639864601914]
The matrix factorization model generates factors which do not incorporate semantic knowledge.
DeepUnHide is able to extract demographic information from the users and items factors in collaborative filtering recommender systems.
arXiv Detail & Related papers (2020-06-17T17:36:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.