Context-Based Tweet Engagement Prediction
- URL: http://arxiv.org/abs/2310.03147v1
- Date: Thu, 28 Sep 2023 08:36:57 GMT
- Title: Context-Based Tweet Engagement Prediction
- Authors: Jovan Jeromela
- Abstract summary: This thesis investigates how well context alone may be used to predict tweet engagement likelihood.
We employed the Spark engine on TU Wien's Little Big Data Cluster to create scalable data preprocessing, feature engineering, feature selection, and machine learning pipelines.
We also found that factors such as the prediction algorithm, training dataset size, training dataset sampling method, and feature selection significantly affect the results.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Twitter is currently one of the biggest social media platforms. Its users may
share, read, and engage with short posts called tweets. For the ACM Recommender
Systems Conference 2020, Twitter published a dataset around 70 GB in size for
the annual RecSys Challenge. In 2020, the RecSys Challenge invited
participating teams to create models that would predict engagement likelihoods
for given user-tweet combinations. The submitted models predicting like, reply,
retweet, and quote engagements were evaluated based on two metrics: area under
the precision-recall curve (PRAUC) and relative cross-entropy (RCE).
In this diploma thesis, we used the RecSys 2020 Challenge dataset and
evaluation procedure to investigate how well context alone may be used to
predict tweet engagement likelihood. In doing so, we employed the Spark engine
on TU Wien's Little Big Data Cluster to create scalable data preprocessing,
feature engineering, feature selection, and machine learning pipelines. We
manually created just under 200 additional features to describe tweet context.
The results indicate that features describing users' prior engagement history
and the popularity of hashtags and links in the tweet were the most
informative. We also found that factors such as the prediction algorithm,
training dataset size, training dataset sampling method, and feature selection
significantly affect the results. After comparing the best results of our
context-only prediction models with content-only models and with models
developed by the Challenge winners, we identified that the context-based models
underperformed in terms of the RCE score. This work thus concludes by situating
this discrepancy and proposing potential improvements to our implementation,
which is shared in a public git repository.
Related papers
- Generator-Guided Crowd Reaction Assessment [4.1756520114950035]
This paper presents a Crowd Reaction AssessMent task designed to estimate if a given social media post will receive more reaction than another.
We introduce the Crowd Reaction Estimation dataset (CRED), consisting of pairs of tweets from The White House with comparative measures of retweet count.
Our results reveal that a fine-tuned FLANG-RoBERTa model, utilizing a cross-encoder architecture with tweet content and responses generated by Claude, performs optimally.
arXiv Detail & Related papers (2024-03-08T13:05:44Z) - Scaling Laws Do Not Scale [54.72120385955072]
Recent work has argued that as the size of a dataset increases, the performance of a model trained on that dataset will increase.
We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output.
Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations.
arXiv Detail & Related papers (2023-07-05T15:32:21Z) - BotArtist: Generic approach for bot detection in Twitter via semi-automatic machine learning pipeline [47.61306219245444]
Twitter has become a target for bots and fake accounts, resulting in the spread of false information and manipulation.
This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges correlated with machine learning model development.
We develop a comprehensive bot detection model named BotArtist, based on user profile features.
arXiv Detail & Related papers (2023-05-31T09:12:35Z) - Predicting the Geolocation of Tweets Using transformer models on Customized Data [17.55660062746406]
This research is aimed to solve the tweet/user geolocation prediction task.
The suggested approach implements neural networks for natural language processing to estimate the location.
The scope of proposed models has been finetuned on a Twitter dataset.
arXiv Detail & Related papers (2023-03-14T12:56:47Z) - Design and analysis of tweet-based election models for the 2021 Mexican
legislative election [55.41644538483948]
We use a dataset of 15 million election-related tweets in the six months preceding election day.
We find that models using data with geographical attributes determine the results of the election with better precision and accuracy than conventional polling methods.
arXiv Detail & Related papers (2023-01-02T12:40:05Z) - Identification of Twitter Bots based on an Explainable ML Framework: the
US 2020 Elections Case Study [72.61531092316092]
This paper focuses on the design of a novel system for identifying Twitter bots based on labeled Twitter data.
Supervised machine learning (ML) framework is adopted using an Extreme Gradient Boosting (XGBoost) algorithm.
Our study also deploys Shapley Additive Explanations (SHAP) for explaining the ML model predictions.
arXiv Detail & Related papers (2021-12-08T14:12:24Z) - Synerise at RecSys 2021: Twitter user engagement prediction with a fast
neural model [0.745554610293091]
We present our 2nd place solution to ACM RecSys 2021 Challenge organized by Twitter.
The challenge aims to predict user engagement for a set of tweets, offering an exceptionally large data set of 1 billion data points.
Average inference time for single tweet engagement prediction is limited to 6ms on a single CPU core with 64GB memory.
arXiv Detail & Related papers (2021-09-23T13:51:09Z) - Model Bias in NLP -- Application to Hate Speech Classification [0.0]
This document sums up our results forthe NLP lecture at ETH in the spring semester 2021.
In this work, a BERT based neural network model is applied to the JIGSAW dataset.
We get precisions from 64% to around 90% while still achieving acceptable recall values of at least lower 60s%.
arXiv Detail & Related papers (2021-09-20T17:56:08Z) - Injecting Knowledge in Data-driven Vehicle Trajectory Predictors [82.91398970736391]
Vehicle trajectory prediction tasks have been commonly tackled from two perspectives: knowledge-driven or data-driven.
In this paper, we propose to learn a "Realistic Residual Block" (RRB) which effectively connects these two perspectives.
Our proposed method outputs realistic predictions by confining the residual range and taking into account its uncertainty.
arXiv Detail & Related papers (2021-03-08T16:03:09Z) - Sentiment Analysis on Social Media Content [0.0]
The aim of this paper is to present a model that can perform sentiment analysis of real data collected from Twitter.
Data in Twitter is highly unstructured which makes it difficult to analyze.
Our proposed model is different from prior work in this field because it combined the use of supervised and unsupervised machine learning algorithms.
arXiv Detail & Related papers (2020-07-04T17:03:30Z) - Augmenting Data for Sarcasm Detection with Unlabeled Conversation
Context [55.898436183096614]
We present a novel data augmentation technique, CRA (Contextual Response Augmentation), which utilizes conversational context to generate meaningful samples for training.
Specifically, our proposed model, trained with the proposed data augmentation technique, participated in the sarcasm detection task of FigLang2020, have won and achieves the best performance in both Reddit and Twitter datasets.
arXiv Detail & Related papers (2020-06-11T09:00:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.