Related papers: Cooking Is All About People: Comment Classification On Cookery Channels Using BERT and Classification Models (Malayalam-English Mix-Code)

Cooking Is All About People: Comment Classification On Cookery Channels Using BERT and Classification Models (Malayalam-English Mix-Code)

URL: http://arxiv.org/abs/2007.04249v3
Date: Wed, 22 Jul 2020 08:40:09 GMT
Title: Cooking Is All About People: Comment Classification On Cookery Channels Using BERT and Classification Models (Malayalam-English Mix-Code)
Authors: Subramaniam Kazhuparambil (1) and Abhishek Kaushik (1 and 2) ((1) Dublin Business School, (2) Dublin City University)
Abstract summary: We have evaluated top-performing classification models for classifying comments which are a mix of different combinations of English and Malayalam. Results indicate that Multinomial Naive Bayes, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Random Forest and Decision Trees offer similar level of accuracy in comment classification.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The scope of a lucrative career promoted by Google through its video distribution platform YouTube has attracted a large number of users to become content creators. An important aspect of this line of work is the feedback received in the form of comments which show how well the content is being received by the audience. However, volume of comments coupled with spam and limited tools for comment classification makes it virtually impossible for a creator to go through each and every comment and gather constructive feedback. Automatic classification of comments is a challenge even for established classification models, since comments are often of variable lengths riddled with slang, symbols and abbreviations. This is a greater challenge where comments are multilingual as the messages are often rife with the respective vernacular. In this work, we have evaluated top-performing classification models for classifying comments which are a mix of different combinations of English and Malayalam (only English, only Malayalam and Mix of English and Malayalam). The statistical analysis of results indicates that Multinomial Naive Bayes, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Random Forest and Decision Trees offer similar level of accuracy in comment classification. Further, we have also evaluated 3 multilingual transformer based language models (BERT, DISTILBERT and XLM) and compared their performance to the traditional machine learning classification techniques. XLM was the top-performing BERT model with an accuracy of 67.31. Random Forest with Term Frequency Vectorizer was the best performing model out of all the traditional classification models with an accuracy of 63.59.

Related papers

Large Language Models For Text Classification: Case Study And Comprehensive Review [0.3428444467046467]
We evaluate the performance of different Large Language Models (LLMs) in comparison with state-of-the-art deep-learning and machine-learning models. Our work reveals significant variations in model responses based on the prompting strategies.
arXiv Detail & Related papers (2025-01-14T22:02:38Z)
Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings. We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z)
Roles of Scaling and Instruction Tuning in Language Perception: Model vs. Human Attention [58.817405319722596]
This work compares the self-attention of several large language models (LLMs) in different sizes to assess the effect of scaling and instruction tuning on language perception. Results show that scaling enhances the human resemblance and improves the effective attention by reducing the trivial pattern reliance, while instruction tuning does not. We also find that current LLMs are consistently closer to non-native than native speakers in attention, suggesting a sub-optimal language perception of all models.
arXiv Detail & Related papers (2023-10-29T17:16:40Z)
cantnlp@LT-EDI-2023: Homophobia/Transphobia Detection in Social Media Comments using Spatio-Temporally Retrained Language Models [0.9012198585960441]
This paper describes our multiclass classification system developed as part of the LTERAN@LP-2023 shared task. We used a BERT-based language model to detect homophobic and transphobic content in social media comments across five language conditions. We developed the best performing seven-label classification system for Malayalam based on weighted macro averaged F1 score.
arXiv Detail & Related papers (2023-08-20T21:30:34Z)
T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest. We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z)
Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z)
Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi [0.966840768820136]
We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi. We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi. We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts.
arXiv Detail & Related papers (2022-04-19T05:07:58Z)
Classifying YouTube Comments Based on Sentiment and Type of Sentence [0.0]
We address the challenge of text extraction and classification from YouTube comments using well-known statistical measures and machine learning models. The results show that our approach that incorporates conventional methods performs well on the classification task, validating its potential in assisting content creators increase viewer engagement on their channel.
arXiv Detail & Related papers (2021-10-31T18:08:10Z)
Transfer Learning for Mining Feature Requests and Bug Reports from Tweets and App Store Reviews [4.446419663487345]
Existing approaches fail to detect feature requests and bug reports with high Recall and acceptable Precision. We train both monolingual and multilingual BERT models and compare the performance with state-of-the-art methods.
arXiv Detail & Related papers (2021-08-02T06:51:13Z)
ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee. It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z)
KBCNMUJAL@HASOC-Dravidian-CodeMix-FIRE2020: Using Machine Learning for Detection of Hate Speech and Offensive Code-Mixed Social Media text [1.0499611180329804]
This paper describes the system submitted by our team, KBCNMUJAL, for Task 2 of the shared task Hate Speech and Offensive Content Identification in Indo-European languages. The datasets of two Dravidian languages Viz. Malayalam and Tamil of size 4000 observations, each were shared by the HASOC organizers. The best performing classification models developed for both languages are applied on test datasets.
arXiv Detail & Related papers (2021-02-19T11:08:02Z)
Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.