Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text
- URL: http://arxiv.org/abs/2006.00206v1
- Date: Sat, 30 May 2020 07:17:27 GMT
- Title: Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text
- Authors: Bharathi Raja Chakravarthi, Vigneshwaran Muralidaran, Ruba
Priyadharshini, John P. McCrae
- Abstract summary: We create a code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube.
In this paper, we describe the process of creating the corpus and assigning polarities.
We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.
- Score: 0.9235531183915556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the sentiment of a comment from a video or an image is an
essential task in many applications. Sentiment analysis of a text can be useful
for various decision-making processes. One such application is to analyse the
popular sentiments of videos on social media based on viewer comments. However,
comments from social media do not follow strict rules of grammar, and they
contain mixing of more than one language, often written in non-native scripts.
Non-availability of annotated code-mixed data for a low-resourced language like
Tamil also adds difficulty to this problem. To overcome this, we created a gold
standard Tamil-English code-switched, sentiment-annotated corpus containing
15,744 comment posts from YouTube. In this paper, we describe the process of
creating the corpus and assigning polarities. We present inter-annotator
agreement and show the results of sentiment analysis trained on this corpus as
a benchmark.
Related papers
- YouTube Comments Decoded: Leveraging LLMs for Low Resource Language Classification [0.0]
We introduce a novel gold standard corpus designed for sarcasm and sentiment detection within code-mixed texts.
The primary objective of this task is to identify sarcasm and sentiment polarity within a code-mixed dataset of Tamil-English and Malayalam-English comments and posts collected from social media platforms.
We experiment with state-of-the-art large language models like GPT-3.5 Turbo via prompting to classify comments into sarcastic or non-sarcastic categories.
arXiv Detail & Related papers (2024-11-06T17:58:01Z) - NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality [52.08735848128973]
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations.
We propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding.
arXiv Detail & Related papers (2024-08-18T15:27:06Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - Sentiment Analysis with R: Natural Language Processing for
Semi-Automated Assessments of Qualitative Data [0.0]
This tutorial introduces the basic functions for performing a sentiment analysis with R and explains how text documents can be analysed step by step.
A comparison of two political speeches illustrates a possible use case.
arXiv Detail & Related papers (2022-06-25T13:25:39Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - What's in the Box? An Analysis of Undesirable Content in the Common
Crawl Corpus [77.34726150561087]
We analyze the Common Crawl, a colossal web corpus extensively used for training language models.
We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures.
arXiv Detail & Related papers (2021-05-06T14:49:43Z) - CMSAOne@Dravidian-CodeMix-FIRE2020: A Meta Embedding and Transformer
model for Code-Mixed Sentiment Analysis on Social Media Text [9.23545668304066]
Code-mixing (CM) is a frequently observed phenomenon that uses multiple languages in an utterance or sentence.
Sentiment analysis (SA) is a fundamental step in NLP and is well studied in the monolingual text.
This paper proposes a meta embedding with a transformer method for sentiment analysis on the Dravidian code-mixed dataset.
arXiv Detail & Related papers (2021-01-22T08:48:27Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - A Sentiment Analysis Dataset for Code-Mixed Malayalam-English [0.8454131372606295]
This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators.
We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.
arXiv Detail & Related papers (2020-05-30T07:32:37Z) - LiSSS: A toy corpus of Spanish Literary Sentences for Emotions detection [1.5356167668895644]
We constitute this corpus by manually classifying the sentences in a set of emotions: Love, Fear, Happiness, Anger and Sadness/Pain.
The LISSS corpus will be available to the community as a free resource to evaluate or create CC-like algorithms.
arXiv Detail & Related papers (2020-05-17T11:14:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.