Exploring the Relationship Between Algorithm Performance, Vocabulary,
and Run-Time in Text Classification
- URL: http://arxiv.org/abs/2104.03848v1
- Date: Thu, 8 Apr 2021 15:49:59 GMT
- Title: Exploring the Relationship Between Algorithm Performance, Vocabulary,
and Run-Time in Text Classification
- Authors: Wilson Fearn, Orion Weller, Kevin Seppi
- Abstract summary: This study examines how preprocessing techniques affect the vocabulary size, model performance, and model run-time.
We show that some individual methods can reduce run-time with no loss of accuracy, while some combinations of methods can trade 2-5% of the accuracy for up to a 65% reduction of run-time.
- Score: 2.7261840344953807
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text classification is a significant branch of natural language processing,
and has many applications including document classification and sentiment
analysis. Unsurprisingly, those who do text classification are concerned with
the run-time of their algorithms, many of which depend on the size of the
corpus' vocabulary due to their bag-of-words representation. Although many
studies have examined the effect of preprocessing techniques on vocabulary size
and accuracy, none have examined how these methods affect a model's run-time.
To fill this gap, we provide a comprehensive study that examines how
preprocessing techniques affect the vocabulary size, model performance, and
model run-time, evaluating ten techniques over four models and two datasets. We
show that some individual methods can reduce run-time with no loss of accuracy,
while some combinations of methods can trade 2-5% of the accuracy for up to a
65% reduction of run-time. Furthermore, some combinations of preprocessing
techniques can even provide a 15% reduction in run-time while simultaneously
improving model accuracy.
Related papers
- Ensembling Finetuned Language Models for Text Classification [55.15643209328513]
Finetuning is a common practice across different communities to adapt pretrained models to particular tasks.
ensembles of neural networks are typically used to boost performance and provide reliable uncertainty estimates.
We present a metadataset with predictions from five large finetuned models on six datasets and report results of different ensembling strategies.
arXiv Detail & Related papers (2024-10-25T09:15:54Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification [4.498100922387482]
Adapters and Low-Rank Adaptation (LoRA) are parameter-efficient fine-tuning techniques designed to make the training of language models more efficient.
Previous results demonstrated that these methods can even improve performance on some classification tasks.
This paper investigates how these techniques influence the classification performance and computation costs compared to full fine-tuning.
arXiv Detail & Related papers (2023-08-14T17:12:43Z) - Analyzing and Reducing the Performance Gap in Cross-Lingual Transfer
with Fine-tuning Slow and Fast [50.19681990847589]
Existing research has shown that a multilingual pre-trained language model fine-tuned with one (source) language also performs well on downstream tasks for non-source languages.
This paper analyzes the fine-tuning process, discovers when the performance gap changes and identifies which network weights affect the overall performance most.
arXiv Detail & Related papers (2023-05-19T06:04:21Z) - Quantitative Stopword Generation for Sentiment Analysis via Recursive
and Iterative Deletion [2.0305676256390934]
Stopwords carry little semantic information and are often removed from text data to reduce dataset size.
We present a novel approach to generate effective stopword sets for specific NLP tasks.
arXiv Detail & Related papers (2022-09-04T03:04:10Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Word Embeddings: Stability and Semantic Change [0.0]
We present an experimental study on the instability of the training process of three of the most influential embedding techniques of the last decade: word2vec, GloVe and fastText.
We propose a statistical model to describe the instability of embedding techniques and introduce a novel metric to measure the instability of the representation of an individual word.
arXiv Detail & Related papers (2020-07-23T16:03:50Z) - Deep learning models for representing out-of-vocabulary words [1.4502611532302039]
We present a performance evaluation of deep learning models for representing out-of-vocabulary (OOV) words.
Although the best technique for handling OOV words is different for each task, Comick, a deep learning method that infers the embedding based on the context and the morphological structure of the OOV word, obtained promising results.
arXiv Detail & Related papers (2020-07-14T19:31:25Z) - FastWordBug: A Fast Method To Generate Adversarial Text Against NLP
Applications [0.5524804393257919]
We present a novel algorithm, FastWordBug, to efficiently generate small text perturbations in a black-box setting.
We evaluate FastWordBug on three real-world text datasets and two state-of-the-art machine learning models under black-box setting.
arXiv Detail & Related papers (2020-01-31T07:39:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.