Lex2Sent: A bagging approach to unsupervised sentiment analysis
- URL: http://arxiv.org/abs/2209.13023v2
- Date: Tue, 22 Oct 2024 15:18:55 GMT
- Title: Lex2Sent: A bagging approach to unsupervised sentiment analysis
- Authors: Kai-Robin Lange, Jonas Rieger, Carsten Jentsch,
- Abstract summary: In this paper, we propose an alternative approach to classifying texts: Lex2Sent.
To classify texts, we train embedding models to determine the distances between document embeddings and the embeddings of a suitable lexicon.
We show that our model outperforms lexica and provides a basis for a high performing few-shot fine-tuning approach in the task of binary sentiment analysis.
- Score: 0.628122931748758
- License:
- Abstract: Unsupervised text classification, with its most common form being sentiment analysis, used to be performed by counting words in a text that were stored in a lexicon, which assigns each word to one class or as a neutral word. In recent years, these lexicon-based methods fell out of favor and were replaced by computationally demanding fine-tuning techniques for encoder-only models such as BERT and zero-shot classification using decoder-only models such as GPT-4. In this paper, we propose an alternative approach: Lex2Sent, which provides improvement over classic lexicon methods but does not require any GPU or external hardware. To classify texts, we train embedding models to determine the distances between document embeddings and the embeddings of the parts of a suitable lexicon. We employ resampling, which results in a bagging effect, boosting the performance of the classification. We show that our model outperforms lexica and provides a basis for a high performing few-shot fine-tuning approach in the task of binary sentiment analysis.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - A Comparison of Lexicon-Based and ML-Based Sentiment Analysis: Are There
Outlier Words? [14.816706893177997]
In this paper we compute sentiment for more than 150,000 English language texts drawn from 4 domains.
We model differences in sentiment scores between approaches for documents in each domain using a regression.
Our findings are that the importance of a word depends on the domain and there are no standout lexical entries which systematically cause differences in sentiment scores.
arXiv Detail & Related papers (2023-11-10T18:21:50Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Text Detoxification using Large Pre-trained Neural Models [57.72086777177844]
We present two novel unsupervised methods for eliminating toxicity in text.
First method combines guidance of the generation process with small style-conditional language models.
Second method uses BERT to replace toxic words with their non-offensive synonyms.
arXiv Detail & Related papers (2021-09-18T11:55:32Z) - LexSubCon: Integrating Knowledge from Lexical Resources into Contextual
Embeddings for Lexical Substitution [76.615287796753]
We introduce LexSubCon, an end-to-end lexical substitution framework based on contextual embedding models.
This is achieved by combining contextual information with knowledge from structured lexical resources.
Our experiments show that LexSubCon outperforms previous state-of-the-art methods on LS07 and CoInCo benchmark datasets.
arXiv Detail & Related papers (2021-07-11T21:25:56Z) - DocSCAN: Unsupervised Text Classification via Learning from Neighbors [2.2082422928825145]
We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN)
For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels.
Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels.
arXiv Detail & Related papers (2021-05-09T21:20:31Z) - Disentangling Homophemes in Lip Reading using Perplexity Analysis [10.262299768603894]
This paper proposes a new application for the Generative Pre-Training transformer.
It serves as a language model to convert visual speech in the form of visemes, to language in the form of words and sentences.
The network uses the search for optimal perplexity to perform the viseme-to-word mapping.
arXiv Detail & Related papers (2020-11-28T12:12:17Z) - Assessing Robustness of Text Classification through Maximal Safe Radius
Computation [21.05890715709053]
We aim to provide guarantees that the model prediction does not change if a word is replaced with a plausible alternative, such as a synonym.
As a measure of robustness, we adopt the notion of the maximal safe radius for a given input text, which is the minimum distance in the embedding space to the decision boundary.
For the upper bound computation, we employ Monte Carlo Tree Search in conjunction with syntactic filtering to analyse the effect of single and multiple word substitutions.
arXiv Detail & Related papers (2020-10-01T09:46:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.