Storywrangler: A massive exploratorium for sociolinguistic, cultural,
socioeconomic, and political timelines using Twitter
- URL: http://arxiv.org/abs/2007.12988v5
- Date: Fri, 16 Jul 2021 18:32:29 GMT
- Title: Storywrangler: A massive exploratorium for sociolinguistic, cultural,
socioeconomic, and political timelines using Twitter
- Authors: Thayer Alshaabi, Jane L. Adams, Michael V. Arnold, Joshua R. Minot,
David R. Dewhurst, Andrew J. Reagan, Christopher M. Danforth, and Peter
Sheridan Dodds
- Abstract summary: In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded.
Here, we describe Storywrangler, a natural language processing instrument designed to carry out an ongoing, day-scale curation of over 100 billion tweets containing roughly 1 trillion 1-grams from 2008 to 2021.
For each day, we break tweets into unigrams, bigrams, and trigrams spanning over 100 languages. We track n-gram usage frequencies, and generate Zipf distributions, for words, hashtags, handles
- Score: 0.9485862597874625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In real-time, social media data strongly imprints world events, popular
culture, and day-to-day conversations by millions of ordinary people at a scale
that is scarcely conventionalized and recorded. Vitally, and absent from many
standard corpora such as books and news archives, sharing and commenting
mechanisms are native to social media platforms, enabling us to quantify social
amplification (i.e., popularity) of trending storylines and contemporary
cultural phenomena. Here, we describe Storywrangler, a natural language
processing instrument designed to carry out an ongoing, day-scale curation of
over 100 billion tweets containing roughly 1 trillion 1-grams from 2008 to
2021. For each day, we break tweets into unigrams, bigrams, and trigrams
spanning over 100 languages. We track n-gram usage frequencies, and generate
Zipf distributions, for words, hashtags, handles, numerals, symbols, and
emojis. We make the data set available through an interactive time series
viewer, and as downloadable time series and daily distributions. Although
Storywrangler leverages Twitter data, our method of extracting and tracking
dynamic changes of n-grams can be extended to any similar social media
platform. We showcase a few examples of the many possible avenues of study we
aim to enable including how social amplification can be visualized through
'contagiograms'. We also present some example case studies that bridge n-gram
time series with disparate data sources to explore sociotechnical dynamics of
famous individuals, box office success, and social unrest.
Related papers
- Enhancing Fake News Detection in Social Media via Label Propagation on Cross-modal Tweet Graph [19.409935976725446]
We present a novel method for detecting fake news in social media.
Our method densifies the graph's connectivity to capture denser interaction better.
We use three publicly available fake news datasets, Twitter, PHEME, and Weibo, for evaluation.
arXiv Detail & Related papers (2024-06-14T09:55:54Z) - TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for
Multilingual Tweet Representations at Twitter [31.698196219228024]
We present TwHIN-BERT, a multilingual language model productionized at Twitter.
Our model is trained on 7 billion tweets covering over 100 distinct languages.
We evaluate our model on various multilingual social recommendation and semantic understanding tasks.
arXiv Detail & Related papers (2022-09-15T19:01:21Z) - Language statistics at different spatial, temporal, and grammatical
scales [48.7576911714538]
We use data from Twitter to explore the rank diversity at different scales.
The greatest changes come from variations in the grammatical scale.
As the grammatical scale grows, the rank diversity curves vary more depending on the temporal and spatial scales.
arXiv Detail & Related papers (2022-07-02T01:38:48Z) - Decay No More: A Persistent Twitter Dataset for Learning Social Meaning [10.227026799075215]
We propose a new persistent English Twitter dataset for social meaning (PTSM)
PTSM consists of $17$ social meaning datasets in $10$ categories of tasks.
We experiment with two SOTA pre-trained language models and show that our PTSM can substitute the actual tweets with paraphrases with marginal performance loss.
arXiv Detail & Related papers (2022-04-10T06:07:54Z) - Extracting Feelings of People Regarding COVID-19 by Social Network
Mining [0.0]
dataset of COVID-related tweets in English language is collected.
More than two million tweets from March 23 to June 23 of 2020 are analyzed.
arXiv Detail & Related papers (2021-10-12T16:45:33Z) - The emojification of sentiment on social media: Collection and analysis
of a longitudinal Twitter sentiment dataset [5.528896840956628]
TM-Senti is a new large-scale, distantly supervised Twitter sentiment dataset with over 184 million tweets.
We describe and assess our methodology to put together a large-scale, emoticon- and emoji-based labelled sentiment analysis dataset.
Our analysis highlights interesting temporal changes, among others in the increasing use of emojis over emoticons.
arXiv Detail & Related papers (2021-08-31T14:54:46Z) - Attend and Select: A Segment Attention based Selection Mechanism for
Microblog Hashtag Generation [69.73215951112452]
A hashtag is formed by tokens or phrases that may originate from various fragmentary segments of the original text.
We propose an end-to-end Transformer-based generation model which consists of three phases: encoding, segments-selection, and decoding.
We introduce two large-scale hashtag generation datasets, which are newly collected from Chinese Weibo and English Twitter.
arXiv Detail & Related papers (2021-06-06T15:13:58Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Streaming Social Event Detection and Evolution Discovery in
Heterogeneous Information Networks [90.3475746663728]
Events are happening in real-world and real-time, which can be planned and organized for occasions, such as social gatherings, festival celebrations, influential meetings or sports activities.
Social media platforms generate a lot of real-time text information regarding public events with different topics.
However, mining social events is challenging because events typically exhibit heterogeneous texture and metadata are often ambiguous.
arXiv Detail & Related papers (2021-04-02T02:13:10Z) - Content-based Analysis of the Cultural Differences between TikTok and
Douyin [95.32409577885645]
Short-form video social media shifts away from the traditional media paradigm by telling the audience a dynamic story to attract their attention.
In particular, different combinations of everyday objects can be employed to represent a unique scene that is both interesting and understandable.
Offered by the same company, TikTok and Douyin are popular examples of such new media that has become popular in recent years.
The hypothesis that they express cultural differences together with media fashion and social idiosyncrasy is the primary target of our research.
arXiv Detail & Related papers (2020-11-03T01:47:49Z) - Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset
for Personality Assessment [50.15466026089435]
We present a novel peer-to-peer Hindi conversation dataset- Vyaktitv.
It consists of high-quality audio and video recordings of the participants, with Hinglish textual transcriptions for each conversation.
The dataset also contains a rich set of socio-demographic features, like income, cultural orientation, amongst several others, for all the participants.
arXiv Detail & Related papers (2020-08-31T17:44:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.