Decay No More: A Persistent Twitter Dataset for Learning Social Meaning
- URL: http://arxiv.org/abs/2204.04611v1
- Date: Sun, 10 Apr 2022 06:07:54 GMT
- Title: Decay No More: A Persistent Twitter Dataset for Learning Social Meaning
- Authors: Chiyu Zhang, Muhammad Abdul-Mageed, El Moatez Billah Nagoudi
- Abstract summary: We propose a new persistent English Twitter dataset for social meaning (PTSM)
PTSM consists of $17$ social meaning datasets in $10$ categories of tasks.
We experiment with two SOTA pre-trained language models and show that our PTSM can substitute the actual tweets with paraphrases with marginal performance loss.
- Score: 10.227026799075215
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the proliferation of social media, many studies resort to social media
to construct datasets for developing social meaning understanding systems. For
the popular case of Twitter, most researchers distribute tweet IDs without the
actual text contents due to the data distribution policy of the platform. One
issue is that the posts become increasingly inaccessible over time, which leads
to unfair comparisons and a temporal bias in social media research. To
alleviate this challenge of data decay, we leverage a paraphrase model to
propose a new persistent English Twitter dataset for social meaning (PTSM).
PTSM consists of $17$ social meaning datasets in $10$ categories of tasks. We
experiment with two SOTA pre-trained language models and show that our PTSM can
substitute the actual tweets with paraphrases with marginal performance loss.
Related papers
- SS-GEN: A Social Story Generation Framework with Large Language Models [87.11067593512716]
Children with Autism Spectrum Disorder (ASD) often misunderstand social situations and struggle to participate in daily routines.
Social Stories are traditionally crafted by psychology experts under strict constraints to address these challenges.
We propose textbfSS-GEN, a framework to generate Social Stories in real-time with broad coverage.
arXiv Detail & Related papers (2024-06-22T00:14:48Z) - CrisisMatch: Semi-Supervised Few-Shot Learning for Fine-Grained Disaster
Tweet Classification [51.58605842457186]
We present a fine-grained disaster tweet classification model under the semi-supervised, few-shot learning setting.
Our model, CrisisMatch, effectively classifies tweets into fine-grained classes of interest using few labeled data and large amounts of unlabeled data.
arXiv Detail & Related papers (2023-10-23T07:01:09Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for
Multilingual Tweet Representations at Twitter [31.698196219228024]
We present TwHIN-BERT, a multilingual language model productionized at Twitter.
Our model is trained on 7 billion tweets covering over 100 distinct languages.
We evaluate our model on various multilingual social recommendation and semantic understanding tasks.
arXiv Detail & Related papers (2022-09-15T19:01:21Z) - Predicting Hate Intensity of Twitter Conversation Threads [26.190359413890537]
We propose DRAGNET++, which aims to predict the intensity of hatred that a tweet can bring in through its reply chain in the future.
It uses the semantic and propagating structure of the tweet threads to maximize the contextual information leading up to and the fall of hate intensity at each subsequent tweet.
We show that DRAGNET++ outperforms all the state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-06-16T18:51:36Z) - Identification of Twitter Bots based on an Explainable ML Framework: the
US 2020 Elections Case Study [72.61531092316092]
This paper focuses on the design of a novel system for identifying Twitter bots based on labeled Twitter data.
Supervised machine learning (ML) framework is adopted using an Extreme Gradient Boosting (XGBoost) algorithm.
Our study also deploys Shapley Additive Explanations (SHAP) for explaining the ML model predictions.
arXiv Detail & Related papers (2021-12-08T14:12:24Z) - The emojification of sentiment on social media: Collection and analysis
of a longitudinal Twitter sentiment dataset [5.528896840956628]
TM-Senti is a new large-scale, distantly supervised Twitter sentiment dataset with over 184 million tweets.
We describe and assess our methodology to put together a large-scale, emoticon- and emoji-based labelled sentiment analysis dataset.
Our analysis highlights interesting temporal changes, among others in the increasing use of emojis over emoticons.
arXiv Detail & Related papers (2021-08-31T14:54:46Z) - Named Entity Recognition for Social Media Texts with Semantic
Augmentation [70.44281443975554]
Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts.
We propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account.
arXiv Detail & Related papers (2020-10-29T10:06:46Z) - Storywrangler: A massive exploratorium for sociolinguistic, cultural,
socioeconomic, and political timelines using Twitter [0.9485862597874625]
In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded.
Here, we describe Storywrangler, a natural language processing instrument designed to carry out an ongoing, day-scale curation of over 100 billion tweets containing roughly 1 trillion 1-grams from 2008 to 2021.
For each day, we break tweets into unigrams, bigrams, and trigrams spanning over 100 languages. We track n-gram usage frequencies, and generate Zipf distributions, for words, hashtags, handles
arXiv Detail & Related papers (2020-07-25T18:09:22Z) - TIMME: Twitter Ideology-detection via Multi-task Multi-relational
Embedding [26.074367752142198]
We aim at solving the problem of predicting people's ideology, or political tendency.
We estimate it by using Twitter data, and formalize it as a classification problem.
arXiv Detail & Related papers (2020-06-02T00:00:39Z) - Privacy-Aware Recommender Systems Challenge on Twitter's Home Timeline [47.434392695347924]
RecSys 2020 Challenge organized by ACM RecSys in partnership with Twitter using this dataset.
This paper touches on the key challenges faced by researchers and professionals striving to predict user engagements.
arXiv Detail & Related papers (2020-04-28T23:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.