What Have Been Learned & What Should Be Learned? An Empirical Study of
How to Selectively Augment Text for Classification
- URL: http://arxiv.org/abs/2109.00175v1
- Date: Wed, 1 Sep 2021 04:03:11 GMT
- Title: What Have Been Learned & What Should Be Learned? An Empirical Study of
How to Selectively Augment Text for Classification
- Authors: Biyang Guo, Sonqiao Han, Hailiang Huang
- Abstract summary: We propose STA (Selective Text Augmentation) to selectively augment the text, where the informative, class-indicating words are emphasized but the irrelevant or noisy words are diminished.
Experiments on four English and Chinese text classification benchmark datasets demonstrate that STA can substantially outperform the non-selective text augmentation methods.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text augmentation techniques are widely used in text classification problems
to improve the performance of classifiers, especially in low-resource
scenarios. Whilst lots of creative text augmentation methods have been
designed, they augment the text in a non-selective manner, which means the less
important or noisy words have the same chances to be augmented as the
informative words, and thereby limits the performance of augmentation. In this
work, we systematically summarize three kinds of role keywords, which have
different functions for text classification, and design effective methods to
extract them from the text. Based on these extracted role keywords, we propose
STA (Selective Text Augmentation) to selectively augment the text, where the
informative, class-indicating words are emphasized but the irrelevant or noisy
words are diminished. Extensive experiments on four English and Chinese text
classification benchmark datasets demonstrate that STA can substantially
outperform the non-selective text augmentation methods.
Related papers
- Evaluating Text Classification Robustness to Part-of-Speech Adversarial Examples [0.6445605125467574]
Adversarial examples are inputs that are designed to trick the decision making process, and are intended to be imperceptible to humans.
For text-based classification systems, changes to the input, a string of text, are always perceptible.
To improve the quality of text-based adversarial examples, we need to know what elements of the input text are worth focusing on.
arXiv Detail & Related papers (2024-08-15T18:33:54Z) - Enhancing Multilingual Voice Toxicity Detection with Speech-Text Alignment [4.2936749846785345]
Toxicity classification for voice heavily relies on semantic content of speech.
We propose a novel framework that utilizes cross-modal learning to integrate the semantic embedding of text into a multilabel speech toxicity classifier.
arXiv Detail & Related papers (2024-06-14T17:56:53Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - RankAug: Augmented data ranking for text classification [0.0]
RankAug is a text-ranking approach that detects and filters out the top augmented texts.
We demonstrate that the judicious selection of filtering techniques can yield a substantial improvement of up to 35% in classification accuracy for under-represented classes.
arXiv Detail & Related papers (2023-11-08T08:47:49Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - Selective Text Augmentation with Word Roles for Low-Resource Text
Classification [3.4806267677524896]
Different words may play different roles in text classification, which inspires us to strategically select the proper roles for text augmentation.
In this work, we first identify the relationships between the words in a text and the text category from the perspectives of statistical correlation and semantic similarity.
We present a new augmentation technique called STA (Selective Text Augmentation) where different text-editing operations are selectively applied to words with specific roles.
arXiv Detail & Related papers (2022-09-04T08:13:11Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt
Verbalizer for Text Classification [68.3291372168167]
We focus on incorporating external knowledge into the verbalizer, forming a knowledgeable prompt-tuning (KPT)
We expand the label word space of the verbalizer using external knowledge bases (KBs) and refine the expanded label word space with the PLM itself before predicting with the expanded label word space.
Experiments on zero and few-shot text classification tasks demonstrate the effectiveness of knowledgeable prompt-tuning.
arXiv Detail & Related papers (2021-08-04T13:00:16Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z) - Improving Disentangled Text Representation Learning with
Information-Theoretic Guidance [99.68851329919858]
discrete nature of natural language makes disentangling of textual representations more challenging.
Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text.
Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation.
arXiv Detail & Related papers (2020-06-01T03:36:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.