Related papers: Quantitative Stopword Generation for Sentiment Analysis via Recursive and Iterative Deletion

Quantitative Stopword Generation for Sentiment Analysis via Recursive and Iterative Deletion

URL: http://arxiv.org/abs/2209.01519v1
Date: Sun, 4 Sep 2022 03:04:10 GMT
Title: Quantitative Stopword Generation for Sentiment Analysis via Recursive and Iterative Deletion
Authors: Daniel M. DiPietro
Abstract summary: Stopwords carry little semantic information and are often removed from text data to reduce dataset size. We present a novel approach to generate effective stopword sets for specific NLP tasks.
Score: 2.0305676256390934
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Stopwords carry little semantic information and are often removed from text data to reduce dataset size and improve machine learning model performance. Consequently, researchers have sought to develop techniques for generating effective stopword sets. Previous approaches have ranged from qualitative techniques relying upon linguistic experts, to statistical approaches that extract word importance using correlations or frequency-dependent metrics computed on a corpus. We present a novel quantitative approach that employs iterative and recursive feature deletion algorithms to see which words can be deleted from a pre-trained transformer's vocabulary with the least degradation to its performance, specifically for the task of sentiment analysis. Empirically, stopword lists generated via this approach drastically reduce dataset size while negligibly impacting model performance, in one such example shrinking the corpus by 28.4% while improving the accuracy of a trained logistic regression model by 0.25%. In another instance, the corpus was shrunk by 63.7% with a 2.8% decrease in accuracy. These promising results indicate that our approach can generate highly effective stopword sets for specific NLP tasks.

Related papers

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training [12.745160748376794]
We propose a soft deduplication method that maintains dataset integrity while selectively reducing the sampling weight of data with high commonness. Central to our approach is the concept of "data commonness", a metric we introduce to quantify the degree of duplication. Empirical analysis shows that this method significantly improves training efficiency, achieving comparable perplexity scores with at least a 26% reduction in required training steps.
arXiv Detail & Related papers (2024-07-09T08:26:39Z)
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords. We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z)
Towards Faster k-Nearest-Neighbor Machine Translation [56.66038663128903]
k-nearest-neighbor machine translation approaches suffer from heavy retrieve overhead on the entire datastore when decoding each token. We propose a simple yet effective multi-layer perceptron (MLP) network to predict whether a token should be translated jointly by the neural machine translation model and probabilities produced by the kNN.
arXiv Detail & Related papers (2023-12-12T16:41:29Z)
Ethicist: Targeted Training Data Extraction Through Loss Smoothed Soft Prompting and Calibrated Confidence Estimation [56.57532238195446]
We propose a method named Ethicist for targeted training data extraction. To elicit memorization, we tune soft prompt embeddings while keeping the model fixed. We show that Ethicist significantly improves the extraction performance on a recently proposed public benchmark.
arXiv Detail & Related papers (2023-07-10T08:03:41Z)
Automatic Counterfactual Augmentation for Robust Text Classification Based on Word-Group Search [12.894936637198471]
In general, a keyword is regarded as a shortcut if it creates a superficial association with the label, resulting in a false prediction. We propose a new Word-Group mining approach, which captures the causal effect of any keyword combination and orders the combinations that most affect the prediction. Our approach bases on effective post-hoc analysis and beam search, which ensures the mining effect and reduces the complexity.
arXiv Detail & Related papers (2023-07-01T02:26:34Z)
Characterizing and Measuring Linguistic Dataset Drift [65.28821163863665]
We propose three dimensions of linguistic dataset drift: vocabulary, structural, and semantic drift. These dimensions correspond to content word frequency divergences, syntactic divergences, and meaning changes not captured by word frequencies. We find that our drift metrics are more effective than previous metrics at predicting out-of-domain model accuracies.
arXiv Detail & Related papers (2023-05-26T17:50:51Z)
Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z)
Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
Exploring the Relationship Between Algorithm Performance, Vocabulary, and Run-Time in Text Classification [2.7261840344953807]
This study examines how preprocessing techniques affect the vocabulary size, model performance, and model run-time. We show that some individual methods can reduce run-time with no loss of accuracy, while some combinations of methods can trade 2-5% of the accuracy for up to a 65% reduction of run-time.
arXiv Detail & Related papers (2021-04-08T15:49:59Z)
Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach. The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features. Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z)
Word Embeddings: Stability and Semantic Change [0.0]
We present an experimental study on the instability of the training process of three of the most influential embedding techniques of the last decade: word2vec, GloVe and fastText. We propose a statistical model to describe the instability of embedding techniques and introduce a novel metric to measure the instability of the representation of an individual word.
arXiv Detail & Related papers (2020-07-23T16:03:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.