BERT-based Authorship Attribution on the Romanian Dataset called ROST
- URL: http://arxiv.org/abs/2301.12500v1
- Date: Sun, 29 Jan 2023 17:37:29 GMT
- Title: BERT-based Authorship Attribution on the Romanian Dataset called ROST
- Authors: Sanda-Maria Avram
- Abstract summary: We use a model to detect the authorship of texts written in the Romanian language.
The dataset used is highly unbalanced, i.e., significant differences in the number of texts per author.
Results are better than expected, sometimes exceeding 87% macro-accuracy.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Being around for decades, the problem of Authorship Attribution is still very
much in focus currently. Some of the more recent instruments used are the
pre-trained language models, the most prevalent being BERT. Here we used such a
model to detect the authorship of texts written in the Romanian language. The
dataset used is highly unbalanced, i.e., significant differences in the number
of texts per author, the sources from which the texts were collected, the time
period in which the authors lived and wrote these texts, the medium intended to
be read (i.e., paper or online), and the type of writing (i.e., stories, short
stories, fairy tales, novels, literary articles, and sketches). The results are
better than expected, sometimes exceeding 87\% macro-accuracy.
Related papers
- A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution [57.309390098903]
Authorship attribution aims to identify the origin or author of a document.
Large Language Models (LLMs) with their deep reasoning capabilities and ability to maintain long-range textual associations offer a promising alternative.
Our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors.
arXiv Detail & Related papers (2024-10-29T04:14:23Z) - Reddit is all you need: Authorship profiling for Romanian [49.1574468325115]
Authorship profiling is the process of identifying an author's characteristics based on their writings.
In this paper, we introduce a corpus of short texts in the Romanian language, annotated with certain author characteristic keywords.
arXiv Detail & Related papers (2024-10-13T16:27:31Z) - Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset [1.174020933567308]
This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts.
A dataset was created, KHAMIS, which consists of handwritten sentences in the East Syriac script.
The data was collected from volunteers capable of reading and writing in the language to create KHAMIS.
The handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets.
arXiv Detail & Related papers (2024-08-24T17:17:46Z) - A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts [8.405938712823563]
This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents.
The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian.
It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era.
arXiv Detail & Related papers (2024-07-21T12:14:45Z) - HANSEN: Human and AI Spoken Text Benchmark for Authorship Analysis [14.467821652366574]
We introduce the largest benchmark for spoken texts - HANSEN (Human ANd ai Spoken tExt beNchmark)
HANSEN encompasses meticulous curation of existing speech datasets accompanied by transcripts, alongside the creation of novel AI-generated spoken text datasets.
To evaluate and demonstrate the utility of HANSEN, we perform Authorship (AA) & Author Verification (AV) on human-spoken datasets and conducted Human vs. AI spoken text detection using state-of-the-art (SOTA) models.
arXiv Detail & Related papers (2023-10-25T16:23:17Z) - Text2Time: Transformer-based Article Time Period Prediction [0.11470070927586018]
This work investigates the problem of predicting the publication period of a text document, specifically a news article, based on its textual content.
We create our own extensive labeled dataset of over 350,000 news articles published by The New York Times over six decades.
In our approach, we use a pretrained BERT model fine-tuned for the task of text classification, specifically for time period prediction.
arXiv Detail & Related papers (2023-04-21T10:05:03Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Attend, Memorize and Generate: Towards Faithful Table-to-Text Generation
in Few Shots [58.404516361586325]
Few-shot table-to-text generation is a task of composing fluent and faithful sentences to convey table content using limited data.
This paper proposes a novel approach, Memorize and Generate (called AMG), inspired by the text generation process of humans.
arXiv Detail & Related papers (2022-03-01T20:37:20Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Forensic Authorship Analysis of Microblogging Texts Using N-Grams and
Stylometric Features [63.48764893706088]
This work aims at identifying authors of tweet messages, which are limited to 280 characters.
We use for our experiments a self-captured database of 40 users, with 120 to 200 tweets per user.
Results using this small set are promising, with the different features providing a classification accuracy between 92% and 98.5%.
arXiv Detail & Related papers (2020-03-24T19:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.