DADIT: A Dataset for Demographic Classification of Italian Twitter Users
and a Comparison of Prediction Methods
- URL: http://arxiv.org/abs/2403.05700v1
- Date: Fri, 8 Mar 2024 22:18:13 GMT
- Title: DADIT: A Dataset for Demographic Classification of Italian Twitter Users
and a Comparison of Prediction Methods
- Authors: Lorenzo Lupo, Paul Bose, Mahyar Habibi, Dirk Hovy, Carlo Schwarz
- Abstract summary: We construct, validate, and release publicly the representative DADIT dataset of 30M tweets of 20k Italian Twitter users.
DADIT enables us to train and compare the performance of various state-of-the-art models for the prediction of the gender and age of social media users.
- Score: 20.590525489367955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social scientists increasingly use demographically stratified social media
data to study the attitudes, beliefs, and behavior of the general public. To
facilitate such analyses, we construct, validate, and release publicly the
representative DADIT dataset of 30M tweets of 20k Italian Twitter users, along
with their bios and profile pictures. We enrich the user data with high-quality
labels for gender, age, and location. DADIT enables us to train and compare the
performance of various state-of-the-art models for the prediction of the gender
and age of social media users. In particular, we investigate if tweets contain
valuable information for the task, since popular classifiers like M3 don't
leverage them. Our best XLM-based classifier improves upon the commonly used
competitor M3 by up to 53% F1. Especially for age prediction, classifiers
profit from including tweets as features. We also confirm these findings on a
German test set.
Related papers
- ThangDLU at #SMM4H 2024: Encoder-decoder models for classifying text data on social disorders in children and adolescents [49.00494558898933]
This paper describes our participation in Task 3 and Task 5 of the #SMM4H (Social Media Mining for Health) 2024 Workshop.
Task 3 is a multi-class classification task centered on tweets discussing the impact of outdoor environments on symptoms of social anxiety.
Task 5 involves a binary classification task focusing on tweets reporting medical disorders in children.
We applied transfer learning from pre-trained encoder-decoder models such as BART-base and T5-small to identify the labels of a set of given tweets.
arXiv Detail & Related papers (2024-04-30T17:06:20Z) - Design and analysis of tweet-based election models for the 2021 Mexican
legislative election [55.41644538483948]
We use a dataset of 15 million election-related tweets in the six months preceding election day.
We find that models using data with geographical attributes determine the results of the election with better precision and accuracy than conventional polling methods.
arXiv Detail & Related papers (2023-01-02T12:40:05Z) - Retweet-BERT: Political Leaning Detection Using Language Features and
Information Diffusion on Social Networks [30.143148646797265]
We introduce Retweet-BERT, a simple and scalable model to estimate the political leanings of Twitter users.
Our assumptions stem from patterns of networks and linguistics homophily among people who share similar ideologies.
arXiv Detail & Related papers (2022-07-18T02:18:20Z) - Identification of Twitter Bots based on an Explainable ML Framework: the
US 2020 Elections Case Study [72.61531092316092]
This paper focuses on the design of a novel system for identifying Twitter bots based on labeled Twitter data.
Supervised machine learning (ML) framework is adopted using an Extreme Gradient Boosting (XGBoost) algorithm.
Our study also deploys Shapley Additive Explanations (SHAP) for explaining the ML model predictions.
arXiv Detail & Related papers (2021-12-08T14:12:24Z) - News consumption and social media regulations policy [70.31753171707005]
We analyze two social media that enforced opposite moderation methods, Twitter and Gab, to assess the interplay between news consumption and content regulation.
Our results show that the presence of moderation pursued by Twitter produces a significant reduction of questionable content.
The lack of clear regulation on Gab results in the tendency of the user to engage with both types of content, showing a slight preference for the questionable ones which may account for a dissing/endorsement behavior.
arXiv Detail & Related papers (2021-06-07T19:26:32Z) - Towards A Sentiment Analyzer for Low-Resource Languages [0.0]
This research aims to analyse a sentiment of the users towards a particular trending topic that has been actively and massively discussed at that time.
We use the hashtag textit#kpujangancurang that was the trending topic during the Indonesia presidential election in 2019.
This research utilizes rapid miner tool to generate the twitter data and comparing Naive Bayes, K-Nearest Neighbor, Decision Tree, and Multi-Layer Perceptron classification methods to classify the sentiment of the twitter data.
arXiv Detail & Related papers (2020-11-12T13:50:00Z) - TweetBERT: A Pretrained Language Representation Model for Twitter Text
Analysis [0.0]
We introduce two TweetBERT models, which are domain specific language presentation models, pre-trained on millions of tweets.
We show that the TweetBERT models significantly outperform the traditional BERT models in Twitter text mining tasks by more than 7% on each Twitter dataset.
arXiv Detail & Related papers (2020-10-17T00:45:02Z) - Sentiment Analysis on Social Media Content [0.0]
The aim of this paper is to present a model that can perform sentiment analysis of real data collected from Twitter.
Data in Twitter is highly unstructured which makes it difficult to analyze.
Our proposed model is different from prior work in this field because it combined the use of supervised and unsupervised machine learning algorithms.
arXiv Detail & Related papers (2020-07-04T17:03:30Z) - TIMME: Twitter Ideology-detection via Multi-task Multi-relational
Embedding [26.074367752142198]
We aim at solving the problem of predicting people's ideology, or political tendency.
We estimate it by using Twitter data, and formalize it as a classification problem.
arXiv Detail & Related papers (2020-06-02T00:00:39Z) - Privacy-Aware Recommender Systems Challenge on Twitter's Home Timeline [47.434392695347924]
RecSys 2020 Challenge organized by ACM RecSys in partnership with Twitter using this dataset.
This paper touches on the key challenges faced by researchers and professionals striving to predict user engagements.
arXiv Detail & Related papers (2020-04-28T23:54:33Z) - #MeToo on Campus: Studying College Sexual Assault at Scale Using Data
Reported on Social Media [71.74529365205053]
We analyze the influence of the # trend on a pool of college followers.
The results show that the majority of topics embedded in those # tweets detail sexual harassment stories.
There exists a significant correlation between the prevalence of this trend and official reports on several major geographical regions.
arXiv Detail & Related papers (2020-01-16T18:05:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.