Related papers: Sentiment Analysis of Code-Mixed Social Media Text (Hinglish)

Sentiment Analysis of Code-Mixed Social Media Text (Hinglish)

URL: http://arxiv.org/abs/2102.12149v1
Date: Wed, 24 Feb 2021 09:15:34 GMT
Title: Sentiment Analysis of Code-Mixed Social Media Text (Hinglish)
Authors: Gaurav Singh
Abstract summary: Various stages involved in performing the sentiment analysis were data consolidation, data cleaning, data transformation and modelling. The models were created using various machine learning algorithms such as SVM, KNN, Decision Trees, Random Forests, Naive Bayes, Logistic Regression, and ensemble voting classifiers.
Score: 4.081440927534578
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper discusses the results obtained for different techniques applied for performing the sentiment analysis of social media (Twitter) code-mixed text written in Hinglish. The various stages involved in performing the sentiment analysis were data consolidation, data cleaning, data transformation and modelling. Various data cleaning techniques were applied, data was cleaned in five iterations and the results of experiments conducted were noted after each iteration. Data was transformed using count vectorizer, one hot vectorizer, tf-idf vectorizer, doc2vec, word2vec and fasttext embeddings. The models were created using various machine learning algorithms such as SVM, KNN, Decision Trees, Random Forests, Naive Bayes, Logistic Regression, and ensemble voting classifiers. The data was obtained from a task on Codalab competition website which was listed as Task:9 on the Semeval-2020 competition website. The models created were evaluated using the F1-score (macro). The best F1-score of 69.07 was achieved using ensemble voting classifier.

Related papers

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models [107.24906866038431]
We propose REWIRE, REcycling the Web with guIded REwrite, to enrich low-quality documents so that they could become useful for training.<n>We show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks.
arXiv Detail & Related papers (2025-06-05T07:12:12Z)
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z)
Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases. Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding. This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z)
MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains. We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images. A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z)
Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields. Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z)
Detecting Handwritten Mathematical Terms with Sensor Based Data [71.84852429039881]
We propose a solution to the UbiComp 2021 Challenge by Stabilo in which handwritten mathematical terms are supposed to be automatically classified. The input data set contains data of different writers, with label strings constructed from a total of 15 different possible characters.
arXiv Detail & Related papers (2021-09-12T19:33:34Z)
KBCNMUJAL@HASOC-Dravidian-CodeMix-FIRE2020: Using Machine Learning for Detection of Hate Speech and Offensive Code-Mixed Social Media text [1.0499611180329804]
This paper describes the system submitted by our team, KBCNMUJAL, for Task 2 of the shared task Hate Speech and Offensive Content Identification in Indo-European languages. The datasets of two Dravidian languages Viz. Malayalam and Tamil of size 4000 observations, each were shared by the HASOC organizers. The best performing classification models developed for both languages are applied on test datasets.
arXiv Detail & Related papers (2021-02-19T11:08:02Z)
Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application [63.10266319378212]
We propose a method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT) We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers.
arXiv Detail & Related papers (2020-09-22T02:15:05Z)
Decision Tree J48 at SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text (Hinglish) [3.007778295477907]
This system uses Weka as a tool for providing the classifier for the classification of tweets. python is used for loading the data from the files provided and cleaning it. The system performance was assessed using the official competition evaluation metric F1-score.
arXiv Detail & Related papers (2020-08-26T06:30:43Z)
Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
Deep Learning Brasil -- NLP at SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets [0.2294014185517203]
In this paper, we describe a methodology to predict sentiment in code-mixed tweets (hindi-english) Our team called verissimo.manoel in CodaLab developed an approach based on an ensemble of four models. The final classification algorithm was an ensemble of some predictions of all softmax values from these four models.
arXiv Detail & Related papers (2020-07-28T16:42:41Z)
Voice@SRIB at SemEval-2020 Task 9 and 12: Stacked Ensembling method for Sentiment and Offensiveness detection in Social Media [2.9008108937701333]
We train embeddings, ensembling methods for Sentimix, and OffensEval tasks. We evaluate our models on macro F1-score, precision, accuracy, and recall on the datasets.
arXiv Detail & Related papers (2020-07-20T11:54:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.