Selective Text Augmentation with Word Roles for Low-Resource Text
Classification
- URL: http://arxiv.org/abs/2209.01560v1
- Date: Sun, 4 Sep 2022 08:13:11 GMT
- Title: Selective Text Augmentation with Word Roles for Low-Resource Text
Classification
- Authors: Biyang Guo, Songqiao Han, Hailiang Huang
- Abstract summary: Different words may play different roles in text classification, which inspires us to strategically select the proper roles for text augmentation.
In this work, we first identify the relationships between the words in a text and the text category from the perspectives of statistical correlation and semantic similarity.
We present a new augmentation technique called STA (Selective Text Augmentation) where different text-editing operations are selectively applied to words with specific roles.
- Score: 3.4806267677524896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data augmentation techniques are widely used in text classification tasks to
improve the performance of classifiers, especially in low-resource scenarios.
Most previous methods conduct text augmentation without considering the
different functionalities of the words in the text, which may generate
unsatisfactory samples. Different words may play different roles in text
classification, which inspires us to strategically select the proper roles for
text augmentation. In this work, we first identify the relationships between
the words in a text and the text category from the perspectives of statistical
correlation and semantic similarity and then utilize them to divide the words
into four roles -- Gold, Venture, Bonus, and Trivial words, which have
different functionalities for text classification. Based on these word roles,
we present a new augmentation technique called STA (Selective Text
Augmentation) where different text-editing operations are selectively applied
to words with specific roles. STA can generate diverse and relatively clean
samples, while preserving the original core semantics, and is also quite simple
to implement. Extensive experiments on 5 benchmark low-resource text
classification datasets illustrate that augmented samples produced by STA
successfully boost the performance of classification models which significantly
outperforms previous non-selective methods, including two large language
model-based techniques. Cross-dataset experiments further indicate that STA can
help the classifiers generalize better to other datasets than previous methods.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - RankAug: Augmented data ranking for text classification [0.0]
RankAug is a text-ranking approach that detects and filters out the top augmented texts.
We demonstrate that the judicious selection of filtering techniques can yield a substantial improvement of up to 35% in classification accuracy for under-represented classes.
arXiv Detail & Related papers (2023-11-08T08:47:49Z) - Learning Context-aware Classifier for Semantic Segmentation [88.88198210948426]
In this paper, contextual hints are exploited via learning a context-aware classifier.
Our method is model-agnostic and can be easily applied to generic segmentation models.
With only negligible additional parameters and +2% inference time, decent performance gain has been achieved on both small and large models.
arXiv Detail & Related papers (2023-03-21T07:00:35Z) - Many-Class Text Classification with Matching [65.74328417321738]
We formulate textbfText textbfClassification as a textbfMatching problem between the text and the labels, and propose a simple yet effective framework named TCM.
Compared with previous text classification approaches, TCM takes advantage of the fine-grained semantic information of the classification labels.
arXiv Detail & Related papers (2022-05-23T15:51:19Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - Hierarchical Heterogeneous Graph Representation Learning for Short Text
Classification [60.233529926965836]
We propose a new method called SHINE, which is based on graph neural network (GNN) for short text classification.
First, we model the short text dataset as a hierarchical heterogeneous graph consisting of word-level component graphs.
Then, we dynamically learn a short document graph that facilitates effective label propagation among similar short texts.
arXiv Detail & Related papers (2021-10-30T05:33:05Z) - What Have Been Learned & What Should Be Learned? An Empirical Study of
How to Selectively Augment Text for Classification [0.0]
We propose STA (Selective Text Augmentation) to selectively augment the text, where the informative, class-indicating words are emphasized but the irrelevant or noisy words are diminished.
Experiments on four English and Chinese text classification benchmark datasets demonstrate that STA can substantially outperform the non-selective text augmentation methods.
arXiv Detail & Related papers (2021-09-01T04:03:11Z) - TF-CR: Weighting Embeddings for Text Classification [6.531659195805749]
We introduce a novel weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings.
Experiments on 16 classification datasets show the effectiveness of TF-CR, leading to improved performance scores over existing weighting schemes.
arXiv Detail & Related papers (2020-12-11T19:23:28Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z) - Exploiting Class Labels to Boost Performance on Embedding-based Text
Classification [16.39344929765961]
embeddings of different kinds have recently become the de facto standard as features used for text classification.
We introduce a weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings.
arXiv Detail & Related papers (2020-06-03T08:53:40Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.