Forensic Authorship Analysis of Microblogging Texts Using N-Grams and
Stylometric Features
- URL: http://arxiv.org/abs/2003.11545v1
- Date: Tue, 24 Mar 2020 19:32:11 GMT
- Title: Forensic Authorship Analysis of Microblogging Texts Using N-Grams and
Stylometric Features
- Authors: Nicole Mariah Sharon Belvisi, Naveed Muhammad, Fernando
Alonso-Fernandez
- Abstract summary: This work aims at identifying authors of tweet messages, which are limited to 280 characters.
We use for our experiments a self-captured database of 40 users, with 120 to 200 tweets per user.
Results using this small set are promising, with the different features providing a classification accuracy between 92% and 98.5%.
- Score: 63.48764893706088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, messages and text posted on the Internet are used in
criminal investigations. Unfortunately, the authorship of many of them remains
unknown. In some channels, the problem of establishing authorship may be even
harder, since the length of digital texts is limited to a certain number of
characters. In this work, we aim at identifying authors of tweet messages,
which are limited to 280 characters. We evaluate popular features employed
traditionally in authorship attribution which capture properties of the writing
style at different levels. We use for our experiments a self-captured database
of 40 users, with 120 to 200 tweets per user. Results using this small set are
promising, with the different features providing a classification accuracy
between 92% and 98.5%. These results are competitive in comparison to existing
studies which employ short texts such as tweets or SMS.
Related papers
- Understanding writing style in social media with a supervised
contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation.
We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts.
Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z) - BERT-based Authorship Attribution on the Romanian Dataset called ROST [0.0]
We use a model to detect the authorship of texts written in the Romanian language.
The dataset used is highly unbalanced, i.e., significant differences in the number of texts per author.
Results are better than expected, sometimes exceeding 87% macro-accuracy.
arXiv Detail & Related papers (2023-01-29T17:37:29Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Writer Recognition Using Off-line Handwritten Single Block Characters [59.17685450892182]
We use personal identity numbers consisting of the six digits of the date of birth, DoB.
We evaluate two recognition approaches, one based on handcrafted features that compute directional measurements, and another based on deep features from a ResNet50 model.
Results show the presence of identity-related information in a piece of handwritten information as small as six digits with the DoB.
arXiv Detail & Related papers (2022-01-25T23:04:10Z) - DeepStyle: User Style Embedding for Authorship Attribution of Short
Texts [57.503904346336384]
Authorship attribution (AA) is an important and widely studied research topic with many applications.
Recent works have shown that deep learning methods could achieve significant accuracy improvement for the AA task.
We propose DeepStyle, a novel embedding-based framework that learns the representations of users' salient writing styles.
arXiv Detail & Related papers (2021-03-14T15:56:37Z) - Writer Identification Using Microblogging Texts for Social Media
Forensics [53.180678723280145]
We evaluate popular stylometric features, widely used in literary analysis, and specific Twitter features like URLs, hashtags, replies or quotes.
We test varying sized author sets and varying amounts of training/test texts per author.
arXiv Detail & Related papers (2020-07-31T00:23:18Z) - A Few Topical Tweets are Enough for Effective User-Level Stance
Detection [8.118808561953514]
We tackle stance detection for vocal Twitter users using two approaches.
In the first approach, we improve user-level stance detection by representing tweets using contextualized embeddings.
In the second approach, we expand the tweets of a given user using their Twitter timeline tweets, and then we perform unsupervised classification of the user.
arXiv Detail & Related papers (2020-04-07T15:35:55Z) - Investigating Classification Techniques with Feature Selection For
Intention Mining From Twitter Feed [0.0]
Micro-blogging service Twitter has more than 200 million registered users who exchange more than 65 million posts per day.
Most of the tweets are written informally and often in slang language.
This paper investigates the problem of selecting features that affect extracting user's intention from Twitter feeds.
arXiv Detail & Related papers (2020-01-22T11:55:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.