Smart Crawling: A New Approach toward Focus Crawling from Twitter
- URL: http://arxiv.org/abs/2110.06022v1
- Date: Fri, 8 Oct 2021 11:04:49 GMT
- Title: Smart Crawling: A New Approach toward Focus Crawling from Twitter
- Authors: Ahmad Khazaie, Nac\'era Bennacer Seghouani, Francesca Bugiotti
- Abstract summary: Twitter data can be accessed using a REST API.
"SmartTwitter Crawling" (STiC) retrieves a set of tweets related to a target topic.
- Score: 0.10312968200748115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Twitter is a social network that offers a rich and interesting source of
information challenging to retrieve and analyze. Twitter data can be accessed
using a REST API. The available operations allow retrieving tweets on the basis
of a set of keywords but with limitations such as the number of calls per
minute and the size of results. Besides, there is no control on retrieved
results and finding tweets which are relevant to a specific topic is a big
issue. Given these limitations, it is important that the query keywords cover
unambiguously the topic of interest in order to both reach the relevant answers
and decrease the number of API calls. In this paper, we introduce a new
crawling algorithm called "SmartTwitter Crawling" (STiC) that retrieves a set
of tweets related to a target topic. In this algorithm, we take an initial
keyword query and enrich it using a set of additional keywords that come from
different data sources. STiC algorithm relies on a DFS search in Twittergraph
where each reached tweet is considered if it is relevant with the query
keywords using a scoring, updated throughout the whole crawling process. This
scoring takes into account the tweet text, hashtags and the users who have
posted the tweet, replied to the tweet, been mentioned in the tweet or
retweeted the tweet. Given this score, STiC is able to select relevant tweets
in each iteration and continue by adding the related valuable tweets. Several
experiments have been achieved for different kinds of queries, the results
showedthat the precision increases compared to a simple BFS search.
Related papers
- Real-Time Summarization of Twitter [9.034423337410274]
We focus on real time push notification scenario, which requires a system monitors the stream of sampled tweets and returns the tweets relevant to given interest profiles.
We employ Dirichlet score with and with very little smoothing (baseline) to classify whether a tweet is relevant to a given interest profile.
It is also desired to remove the redundant tweets from the pushing queue.
arXiv Detail & Related papers (2024-07-11T01:56:31Z) - Hashtag-Guided Low-Resource Tweet Classification [31.810562621519804]
We propose a novel Hashtag-guided Tweet Classification model (HashTation)
HashTation automatically generates meaningful hashtags for the input tweet to provide useful auxiliary signals for tweet classification.
Experiments show that HashTation achieves significant improvements on seven low-resource tweet classification tasks.
arXiv Detail & Related papers (2023-02-20T18:21:02Z) - Semantic Parsing for Conversational Question Answering over Knowledge
Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof.
We present two different semantic parsing approaches and highlight the challenges of the task.
Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z) - Manipulating Twitter Through Deletions [64.33261764633504]
Research into influence campaigns on Twitter has mostly relied on identifying malicious activities from tweets obtained via public APIs.
Here, we provide the first exhaustive, large-scale analysis of anomalous deletion patterns involving more than a billion deletions by over 11 million accounts.
We find that a small fraction of accounts delete a large number of tweets daily.
First, limits on tweet volume are circumvented, allowing certain accounts to flood the network with over 26 thousand daily tweets.
Second, coordinated networks of accounts engage in repetitive likes and unlikes of content that is eventually deleted, which can manipulate ranking algorithms.
arXiv Detail & Related papers (2022-03-25T20:07:08Z) - Identification of Twitter Bots based on an Explainable ML Framework: the
US 2020 Elections Case Study [72.61531092316092]
This paper focuses on the design of a novel system for identifying Twitter bots based on labeled Twitter data.
Supervised machine learning (ML) framework is adopted using an Extreme Gradient Boosting (XGBoost) algorithm.
Our study also deploys Shapley Additive Explanations (SHAP) for explaining the ML model predictions.
arXiv Detail & Related papers (2021-12-08T14:12:24Z) - A Case Study to Reveal if an Area of Interest has a Trend in Ongoing
Tweets Using Word and Sentence Embeddings [0.0]
We have proposed an easily applicable automated methodology in which the Daily Mean Similarity Scores show the similarity between the daily tweet corpus and the target words.
The Daily Mean Similarity Scores have mainly based on cosine similarity and word/sentence embeddings.
We have also compared the effectiveness of using word versus sentence embeddings while applying our methodology and realized that both give almost the same results.
arXiv Detail & Related papers (2021-10-02T18:44:55Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - How Will Your Tweet Be Received? Predicting the Sentiment Polarity of
Tweet Replies [3.5263924621989196]
We propose a new task: predicting the predominant sentiment among (first-order) replies to a given tweet.
We create RETWEET, a large dataset of tweets and replies manually annotated with sentiment labels.
We use the automatically labeled data for supervised training of a neural network to predict reply sentiment from the original tweets.
arXiv Detail & Related papers (2021-04-21T13:08:45Z) - Covid-Transformer: Detecting COVID-19 Trending Topics on Twitter Using
Universal Sentence Encoder [7.305019142196582]
corona-virus disease (also known as COVID-19) has led to a pandemic, impacting more than 200 countries across the globe.
With its global impact, COVID-19 has become a major concern of people almost everywhere.
We try to analyze the tweets and detect the trending topics and major concerns of people on Twitter.
arXiv Detail & Related papers (2020-09-08T19:00:38Z) - Writer Identification Using Microblogging Texts for Social Media
Forensics [53.180678723280145]
We evaluate popular stylometric features, widely used in literary analysis, and specific Twitter features like URLs, hashtags, replies or quotes.
We test varying sized author sets and varying amounts of training/test texts per author.
arXiv Detail & Related papers (2020-07-31T00:23:18Z) - On Identifying Hashtags in Disaster Twitter Data [55.17975121160699]
We construct a unique dataset of disaster-related tweets annotated with hashtags useful for filtering actionable information.
Using this dataset, we investigate Long Short Term Memory-based models within a Multi-Task Learning framework.
The best performing model achieves an F1-score as high as 92.22%.
arXiv Detail & Related papers (2020-01-05T22:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.