A Case Study of Spanish Text Transformations for Twitter Sentiment
Analysis
- URL: http://arxiv.org/abs/2106.02009v1
- Date: Thu, 3 Jun 2021 17:24:31 GMT
- Title: A Case Study of Spanish Text Transformations for Twitter Sentiment
Analysis
- Authors: Eric S. Tellez, Sabino Miranda-Jim\'enez, Mario Graff, Daniela
Moctezuma, Oscar S. Siodia, and Elio A. Villase\~nor
- Abstract summary: Sentiment analysis is a text mining task that determines the polarity of a given text, i.e., its positiveness or negativeness.
New forms of textual expressions present new challenges to analyze text given the use of slang, orthographic and grammatical errors.
- Score: 1.9694608733361543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentiment analysis is a text mining task that determines the polarity of a
given text, i.e., its positiveness or negativeness. Recently, it has received a
lot of attention given the interest in opinion mining in micro-blogging
platforms. These new forms of textual expressions present new challenges to
analyze text given the use of slang, orthographic and grammatical errors, among
others. Along with these challenges, a practical sentiment classifier should be
able to handle efficiently large workloads.
The aim of this research is to identify which text transformations
(lemmatization, stemming, entity removal, among others), tokenizers (e.g.,
words $n$-grams), and tokens weighting schemes impact the most the accuracy of
a classifier (Support Vector Machine) trained on two Spanish corpus. The
methodology used is to exhaustively analyze all the combinations of the text
transformations and their respective parameters to find out which
characteristics the best performing classifiers have in common. Furthermore,
among the different text transformations studied, we introduce a novel approach
based on the combination of word based $n$-grams and character based $q$-grams.
The results show that this novel combination of words and characters produces a
classifier that outperforms the traditional word based combination by $11.17\%$
and $5.62\%$ on the INEGI and TASS'15 dataset, respectively.
Related papers
- Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - Human-in-the-Loop Synthetic Text Data Inspection with Provenance Tracking [11.022295941449919]
We develop INSPECTOR, a human-in-the-loop data inspection technique.
In a user study, INSPECTOR increases the number of texts with correct labels identified by 3X on a sentiment analysis task and by 4X on a hate speech detection task.
arXiv Detail & Related papers (2024-04-29T17:16:27Z) - Leveraging ChatGPT As Text Annotation Tool For Sentiment Analysis [6.596002578395151]
ChatGPT is a new product of OpenAI and has emerged as the most popular AI product.
This study explores the use of ChatGPT as a tool for data labeling for different sentiment analysis tasks.
arXiv Detail & Related papers (2023-06-18T12:20:42Z) - Lexical Complexity Prediction: An Overview [13.224233182417636]
The occurrence of unknown words in texts significantly hinders reading comprehension.
computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives.
We present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data.
arXiv Detail & Related papers (2023-03-08T19:35:08Z) - Like a Good Nearest Neighbor: Practical Content Moderation and Text
Classification [66.02091763340094]
Like a Good Nearest Neighbor (LaGoNN) is a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor.
LaGoNN is effective at flagging undesirable content and text classification, and improves the performance of SetFit.
arXiv Detail & Related papers (2023-02-17T15:43:29Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Context-Sensitive Visualization of Deep Learning Natural Language
Processing Models [9.694190108703229]
We propose a new NLP Transformer context-sensitive visualization method.
It finds the most significant groups of tokens (words) that have the greatest effect on the output.
The most influential word combinations are visualized in a heatmap.
arXiv Detail & Related papers (2021-05-25T20:26:38Z) - Match-Ignition: Plugging PageRank into Transformer for Long-form Text
Matching [66.71886789848472]
We propose a novel hierarchical noise filtering model, namely Match-Ignition, to tackle the effectiveness and efficiency problem.
The basic idea is to plug the well-known PageRank algorithm into the Transformer, to identify and filter both sentence and word level noisy information.
Noisy sentences are usually easy to detect because the sentence is the basic unit of a long-form text, so we directly use PageRank to filter such information.
arXiv Detail & Related papers (2021-01-16T10:34:03Z) - TextScanner: Reading Characters in Order for Robust Scene Text
Recognition [60.04267660533966]
TextScanner is an alternative approach for scene text recognition.
It generates pixel-wise, multi-channel segmentation maps for character class, position and order.
It also adopts RNN for context modeling and performs paralleled prediction for character position and class.
arXiv Detail & Related papers (2019-12-28T07:52:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.