Writer Identification Using Microblogging Texts for Social Media
Forensics
- URL: http://arxiv.org/abs/2008.01533v2
- Date: Sat, 6 Mar 2021 02:42:18 GMT
- Title: Writer Identification Using Microblogging Texts for Social Media
Forensics
- Authors: Fernando Alonso-Fernandez, Nicole Mariah Sharon Belvisi, Kevin
Hernandez-Diaz, Naveed Muhammad, Josef Bigun
- Abstract summary: We evaluate popular stylometric features, widely used in literary analysis, and specific Twitter features like URLs, hashtags, replies or quotes.
We test varying sized author sets and varying amounts of training/test texts per author.
- Score: 53.180678723280145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Establishing authorship of online texts is fundamental to combat cybercrimes.
Unfortunately, text length is limited on some platforms, making the challenge
harder. We aim at identifying the authorship of Twitter messages limited to 140
characters. We evaluate popular stylometric features, widely used in literary
analysis, and specific Twitter features like URLs, hashtags, replies or quotes.
We use two databases with 93 and 3957 authors, respectively. We test varying
sized author sets and varying amounts of training/test texts per author.
Performance is further improved by feature combination via automatic selection.
With a large number of training Tweets (>500), a good accuracy (Rank-5>80%) is
achievable with only a few dozens of test Tweets, even with several thousands
of authors. With smaller sample sizes (10-20 training Tweets), the search space
can be diminished by 9-15% while keeping a high chance that the correct author
is retrieved among the candidates. In such cases, automatic attribution can
provide significant time savings to experts in suspect search. For
completeness, we report verification results. With few training/test Tweets,
the EER is above 20-25%, which is reduced to < 15% if hundreds of training
Tweets are available. We also quantify the computational complexity and time
permanence of the employed features.
Related papers
- Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training [55.321010757641524]
A major public concern regarding the training of large language models (LLMs) is whether they abusing copyrighted online text.
Previous membership inference methods may be misled by similar examples in vast amounts of training data.
We propose an alternative textitinsert-and-detection methodology, advocating that web users and content platforms employ textbftextitunique identifiers.
arXiv Detail & Related papers (2024-03-23T06:36:32Z) - Understanding writing style in social media with a supervised
contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation.
We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts.
Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Identification of Twitter Bots based on an Explainable ML Framework: the
US 2020 Elections Case Study [72.61531092316092]
This paper focuses on the design of a novel system for identifying Twitter bots based on labeled Twitter data.
Supervised machine learning (ML) framework is adopted using an Extreme Gradient Boosting (XGBoost) algorithm.
Our study also deploys Shapley Additive Explanations (SHAP) for explaining the ML model predictions.
arXiv Detail & Related papers (2021-12-08T14:12:24Z) - Semi-automatic Generation of Multilingual Datasets for Stance Detection
in Twitter [9.359018642178917]
This paper presents a method to obtain multilingual datasets for stance detection in Twitter.
We leverage user-based information to semi-automatically label large amounts of tweets.
arXiv Detail & Related papers (2021-01-28T13:05:09Z) - Text-independent writer identification using convolutional neural
network [8.526559246026162]
We propose an end-to-end deep-learning method for text-independent writer identification.
Our method achieved over 91.81% accuracy to classify writers.
arXiv Detail & Related papers (2020-09-10T14:18:03Z) - A Few Topical Tweets are Enough for Effective User-Level Stance
Detection [8.118808561953514]
We tackle stance detection for vocal Twitter users using two approaches.
In the first approach, we improve user-level stance detection by representing tweets using contextualized embeddings.
In the second approach, we expand the tweets of a given user using their Twitter timeline tweets, and then we perform unsupervised classification of the user.
arXiv Detail & Related papers (2020-04-07T15:35:55Z) - Forensic Authorship Analysis of Microblogging Texts Using N-Grams and
Stylometric Features [63.48764893706088]
This work aims at identifying authors of tweet messages, which are limited to 280 characters.
We use for our experiments a self-captured database of 40 users, with 120 to 200 tweets per user.
Results using this small set are promising, with the different features providing a classification accuracy between 92% and 98.5%.
arXiv Detail & Related papers (2020-03-24T19:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.