Related papers: Smart Crawling: A New Approach toward Focus Crawling from Twitter

Smart Crawling: A New Approach toward Focus Crawling from Twitter

URL: http://arxiv.org/abs/2110.06022v1
Date: Fri, 8 Oct 2021 11:04:49 GMT
Title: Smart Crawling: A New Approach toward Focus Crawling from Twitter
Authors: Ahmad Khazaie, Nac\'era Bennacer Seghouani, Francesca Bugiotti
Abstract summary: Twitter data can be accessed using a REST API. "SmartTwitter Crawling" (STiC) retrieves a set of tweets related to a target topic.
Score: 0.10312968200748115
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Twitter is a social network that offers a rich and interesting source of information challenging to retrieve and analyze. Twitter data can be accessed using a REST API. The available operations allow retrieving tweets on the basis of a set of keywords but with limitations such as the number of calls per minute and the size of results. Besides, there is no control on retrieved results and finding tweets which are relevant to a specific topic is a big issue. Given these limitations, it is important that the query keywords cover unambiguously the topic of interest in order to both reach the relevant answers and decrease the number of API calls. In this paper, we introduce a new crawling algorithm called "SmartTwitter Crawling" (STiC) that retrieves a set of tweets related to a target topic. In this algorithm, we take an initial keyword query and enrich it using a set of additional keywords that come from different data sources. STiC algorithm relies on a DFS search in Twittergraph where each reached tweet is considered if it is relevant with the query keywords using a scoring, updated throughout the whole crawling process. This scoring takes into account the tweet text, hashtags and the users who have posted the tweet, replied to the tweet, been mentioned in the tweet or retweeted the tweet. Given this score, STiC is able to select relevant tweets in each iteration and continue by adding the related valuable tweets. Several experiments have been achieved for different kinds of queries, the results showedthat the precision increases compared to a simple BFS search.

Related papers

Real-Time Summarization of Twitter [9.034423337410274]
We focus on real time push notification scenario, which requires a system monitors the stream of sampled tweets and returns the tweets relevant to given interest profiles. We employ Dirichlet score with and with very little smoothing (baseline) to classify whether a tweet is relevant to a given interest profile. It is also desired to remove the redundant tweets from the pushing queue.
arXiv Detail & Related papers (2024-07-11T01:56:31Z)
LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries [53.843367588870585]
List K-kNN spatial keyword queries (TkQs) return a list of objects based on a ranking function that considers both spatial and textual relevance. There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced results. We develop a novel pseudolabel generation technique to address the two challenges.
arXiv Detail & Related papers (2024-03-12T05:32:33Z)
Hashtag-Guided Low-Resource Tweet Classification [31.810562621519804]
We propose a novel Hashtag-guided Tweet Classification model (HashTation) HashTation automatically generates meaningful hashtags for the input tweet to provide useful auxiliary signals for tweet classification. Experiments show that HashTation achieves significant improvements on seven low-resource tweet classification tasks.
arXiv Detail & Related papers (2023-02-20T18:21:02Z)
Manipulating Twitter Through Deletions [64.33261764633504]
Research into influence campaigns on Twitter has mostly relied on identifying malicious activities from tweets obtained via public APIs. Here, we provide the first exhaustive, large-scale analysis of anomalous deletion patterns involving more than a billion deletions by over 11 million accounts. We find that a small fraction of accounts delete a large number of tweets daily. First, limits on tweet volume are circumvented, allowing certain accounts to flood the network with over 26 thousand daily tweets. Second, coordinated networks of accounts engage in repetitive likes and unlikes of content that is eventually deleted, which can manipulate ranking algorithms.
arXiv Detail & Related papers (2022-03-25T20:07:08Z)
Identification of Twitter Bots based on an Explainable ML Framework: the US 2020 Elections Case Study [72.61531092316092]
This paper focuses on the design of a novel system for identifying Twitter bots based on labeled Twitter data. Supervised machine learning (ML) framework is adopted using an Extreme Gradient Boosting (XGBoost) algorithm. Our study also deploys Shapley Additive Explanations (SHAP) for explaining the ML model predictions.
arXiv Detail & Related papers (2021-12-08T14:12:24Z)
A Case Study to Reveal if an Area of Interest has a Trend in Ongoing Tweets Using Word and Sentence Embeddings [0.0]
We have proposed an easily applicable automated methodology in which the Daily Mean Similarity Scores show the similarity between the daily tweet corpus and the target words. The Daily Mean Similarity Scores have mainly based on cosine similarity and word/sentence embeddings. We have also compared the effectiveness of using word versus sentence embeddings while applying our methodology and realized that both give almost the same results.
arXiv Detail & Related papers (2021-10-02T18:44:55Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
How Will Your Tweet Be Received? Predicting the Sentiment Polarity of Tweet Replies [3.5263924621989196]
We propose a new task: predicting the predominant sentiment among (first-order) replies to a given tweet. We create RETWEET, a large dataset of tweets and replies manually annotated with sentiment labels. We use the automatically labeled data for supervised training of a neural network to predict reply sentiment from the original tweets.
arXiv Detail & Related papers (2021-04-21T13:08:45Z)
Covid-Transformer: Detecting COVID-19 Trending Topics on Twitter Using Universal Sentence Encoder [7.305019142196582]
corona-virus disease (also known as COVID-19) has led to a pandemic, impacting more than 200 countries across the globe. With its global impact, COVID-19 has become a major concern of people almost everywhere. We try to analyze the tweets and detect the trending topics and major concerns of people on Twitter.
arXiv Detail & Related papers (2020-09-08T19:00:38Z)
Writer Identification Using Microblogging Texts for Social Media Forensics [53.180678723280145]
We evaluate popular stylometric features, widely used in literary analysis, and specific Twitter features like URLs, hashtags, replies or quotes. We test varying sized author sets and varying amounts of training/test texts per author.
arXiv Detail & Related papers (2020-07-31T00:23:18Z)
On Identifying Hashtags in Disaster Twitter Data [55.17975121160699]
We construct a unique dataset of disaster-related tweets annotated with hashtags useful for filtering actionable information. Using this dataset, we investigate Long Short Term Memory-based models within a Multi-Task Learning framework. The best performing model achieves an F1-score as high as 92.22%.
arXiv Detail & Related papers (2020-01-05T22:37:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.