Related papers: Lex2Sent: A bagging approach to unsupervised sentiment analysis

Lex2Sent: A bagging approach to unsupervised sentiment analysis

URL: http://arxiv.org/abs/2209.13023v2
Date: Tue, 22 Oct 2024 15:18:55 GMT
Title: Lex2Sent: A bagging approach to unsupervised sentiment analysis
Authors: Kai-Robin Lange, Jonas Rieger, Carsten Jentsch,
Abstract summary: In this paper, we propose an alternative approach to classifying texts: Lex2Sent. To classify texts, we train embedding models to determine the distances between document embeddings and the embeddings of a suitable lexicon. We show that our model outperforms lexica and provides a basis for a high performing few-shot fine-tuning approach in the task of binary sentiment analysis.
Score: 0.628122931748758
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unsupervised text classification, with its most common form being sentiment analysis, used to be performed by counting words in a text that were stored in a lexicon, which assigns each word to one class or as a neutral word. In recent years, these lexicon-based methods fell out of favor and were replaced by computationally demanding fine-tuning techniques for encoder-only models such as BERT and zero-shot classification using decoder-only models such as GPT-4. In this paper, we propose an alternative approach: Lex2Sent, which provides improvement over classic lexicon methods but does not require any GPU or external hardware. To classify texts, we train embedding models to determine the distances between document embeddings and the embeddings of the parts of a suitable lexicon. We employ resampling, which results in a bagging effect, boosting the performance of the classification. We show that our model outperforms lexica and provides a basis for a high performing few-shot fine-tuning approach in the task of binary sentiment analysis.

Related papers

Lexical Substitution is not Synonym Substitution: On the Importance of Producing Contextually Relevant Word Substitutes [5.065947993017158]
We introduce ConCat, a simple augmented approach which utilizes the original sentence to bolster contextual information sent to the model. Our study includes a quantitative evaluation, measured via sentence similarity and task performance. We also conduct a qualitative human analysis to validate that users prefer the substitutions proposed by our method, as opposed to previous methods.
arXiv Detail & Related papers (2025-02-06T16:05:50Z)
Extract Free Dense Misalignment from CLIP [7.0247398611254175]
This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP. We revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment. Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models.
arXiv Detail & Related papers (2024-12-24T12:51:05Z)
Token-Level Graphs for Short Text Classification [1.6819960041696331]
We propose an approach which constructs text graphs entirely based on tokens obtained through pre-trained language models (PLMs) Our method captures contextual and semantic information, overcomes vocabulary constraints, and allows for context-dependent word meanings. Experimental results demonstrate how our method consistently achieves higher scores or on-par performance with existing methods.
arXiv Detail & Related papers (2024-12-17T10:19:44Z)
Label-template based Few-Shot Text Classification with Contrastive Learning [7.964862748983985]
We propose a simple and effective few-shot text classification framework. Label templates are embedded into input sentences to fully utilize the potential value of class labels. supervised contrastive learning is utilized to model the interaction information between support samples and query samples.
arXiv Detail & Related papers (2024-12-13T12:51:50Z)
Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings. First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss. Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
A Comparison of Lexicon-Based and ML-Based Sentiment Analysis: Are There Outlier Words? [14.816706893177997]
In this paper we compute sentiment for more than 150,000 English language texts drawn from 4 domains. We model differences in sentiment scores between approaches for documents in each domain using a regression. Our findings are that the importance of a word depends on the domain and there are no standout lexical entries which systematically cause differences in sentiment scores.
arXiv Detail & Related papers (2023-11-10T18:21:50Z)
LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets. The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z)
Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages. We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z)
Text Detoxification using Large Pre-trained Neural Models [57.72086777177844]
We present two novel unsupervised methods for eliminating toxicity in text. First method combines guidance of the generation process with small style-conditional language models. Second method uses BERT to replace toxic words with their non-offensive synonyms.
arXiv Detail & Related papers (2021-09-18T11:55:32Z)
LexSubCon: Integrating Knowledge from Lexical Resources into Contextual Embeddings for Lexical Substitution [76.615287796753]
We introduce LexSubCon, an end-to-end lexical substitution framework based on contextual embedding models. This is achieved by combining contextual information with knowledge from structured lexical resources. Our experiments show that LexSubCon outperforms previous state-of-the-art methods on LS07 and CoInCo benchmark datasets.
arXiv Detail & Related papers (2021-07-11T21:25:56Z)
DocSCAN: Unsupervised Text Classification via Learning from Neighbors [2.2082422928825145]
We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN) For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels. Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels.
arXiv Detail & Related papers (2021-05-09T21:20:31Z)
Disentangling Homophemes in Lip Reading using Perplexity Analysis [10.262299768603894]
This paper proposes a new application for the Generative Pre-Training transformer. It serves as a language model to convert visual speech in the form of visemes, to language in the form of words and sentences. The network uses the search for optimal perplexity to perform the viseme-to-word mapping.
arXiv Detail & Related papers (2020-11-28T12:12:17Z)
Assessing Robustness of Text Classification through Maximal Safe Radius Computation [21.05890715709053]
We aim to provide guarantees that the model prediction does not change if a word is replaced with a plausible alternative, such as a synonym. As a measure of robustness, we adopt the notion of the maximal safe radius for a given input text, which is the minimum distance in the embedding space to the decision boundary. For the upper bound computation, we employ Monte Carlo Tree Search in conjunction with syntactic filtering to analyse the effect of single and multiple word substitutions.
arXiv Detail & Related papers (2020-10-01T09:46:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.