PART: Pre-trained Authorship Representation Transformer
- URL: http://arxiv.org/abs/2209.15373v1
- Date: Fri, 30 Sep 2022 11:08:39 GMT
- Title: PART: Pre-trained Authorship Representation Transformer
- Authors: Javier Huertas-Tato, Alvaro Huertas-Garcia, Alejandro Martin, David
Camacho
- Abstract summary: Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
- Score: 64.78260098263489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Authors writing documents imprint identifying information within their texts:
vocabulary, registry, punctuation, misspellings, or even emoji usage. Finding
these details is very relevant to profile authors, relating back to their
gender, occupation, age, and so on. But most importantly, repeating writing
patterns can help attributing authorship to a text. Previous works use
hand-crafted features or classification tasks to train their authorship models,
leading to poor performance on out-of-domain authors. A better approach to this
task is to learn stylometric representations, but this by itself is an open
research challenge. In this paper, we propose PART: a contrastively trained
model fit to learn \textbf{authorship embeddings} instead of semantics. By
comparing pairs of documents written by the same author, we are able to
determine the proprietary of a text by evaluating the cosine similarity of the
evaluated documents, a zero-shot generalization to authorship identification.
To this end, a pre-trained Transformer with an LSTM head is trained with the
contrastive training method. We train our model on a diverse set of authors,
from literature, anonymous blog posters and corporate emails; a heterogeneous
set with distinct and identifiable writing styles. The model is evaluated on
these datasets, achieving zero-shot 72.39\% and 86.73\% accuracy and top-5
accuracy respectively on the joint evaluation dataset when determining
authorship from a set of 250 different authors. We qualitatively assess the
representations with different data visualizations on the available datasets,
profiling features such as book types, gender, age, or occupation of the
author.
Related papers
- A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution [57.309390098903]
Authorship attribution aims to identify the origin or author of a document.
Large Language Models (LLMs) with their deep reasoning capabilities and ability to maintain long-range textual associations offer a promising alternative.
Our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors.
arXiv Detail & Related papers (2024-10-29T04:14:23Z) - BookWorm: A Dataset for Character Description and Analysis [59.186325346763184]
We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation.
We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses.
Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks.
arXiv Detail & Related papers (2024-10-14T10:55:58Z) - Capturing Style in Author and Document Representation [4.323709559692927]
We propose a new architecture that learns embeddings for both authors and documents with a stylistic constraint.
We evaluate our method on three datasets: a literary corpus extracted from the Gutenberg Project, the Blog Authorship and IMDb62.
arXiv Detail & Related papers (2024-07-18T10:01:09Z) - Self-Supervised Representation Learning for Online Handwriting Text
Classification [0.8594140167290099]
We propose the novel Part of Stroke Masking (POSM) as a pretext task for pretraining models to extract informative representations from the online handwriting of individuals in English and Chinese languages.
To evaluate the quality of the extracted representations, we use both intrinsic and extrinsic evaluation methods.
The pretrained models are fine-tuned to achieve state-of-the-art results in tasks such as writer identification, gender classification, and handedness classification.
arXiv Detail & Related papers (2023-10-10T14:07:49Z) - Can Authorship Representation Learning Capture Stylistic Features? [5.812943049068866]
We show that representations learned for a surrogate authorship prediction task are indeed sensitive to writing style.
As a consequence, authorship representations may be expected to be robust to certain kinds of data shift, such as topic drift over time.
Our findings may open the door to downstream applications that require stylistic representations, such as style transfer.
arXiv Detail & Related papers (2023-08-22T15:10:45Z) - Towards Writing Style Adaptation in Handwriting Recognition [0.0]
We explore models with writer-dependent parameters which take the writer's identity as an additional input.
We propose a Writer Style Block (WSB), an adaptive instance normalization layer conditioned on learned embeddings of the partitions.
We show that our approach outperforms a baseline with no WSB in a writer-dependent scenario and that it is possible to estimate embeddings for new writers.
arXiv Detail & Related papers (2023-02-13T12:36:17Z) - Cloning Ideology and Style using Deep Learning [0.0]
Research focuses on text generation based on the ideology and style of a specific author, and text generation on a topic that was not written by the same author in the past.
Bi-LSTM model is used to make predictions at the character level, during the training corpus of a specific author is used along with the ground truth corpus.
A pre-trained model is used to identify the sentences of ground truth having contradiction with the author's corpus to make our language model inclined.
arXiv Detail & Related papers (2022-10-25T11:37:19Z) - Unsupervised Neural Stylistic Text Generation using Transfer learning
and Adapters [66.17039929803933]
We propose a novel transfer learning framework which updates only $0.3%$ of model parameters to learn style specific attributes for response generation.
We learn style specific attributes from the PERSONALITY-CAPTIONS dataset.
arXiv Detail & Related papers (2022-10-07T00:09:22Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.