Same or Different? Diff-Vectors for Authorship Analysis
- URL: http://arxiv.org/abs/2301.09862v1
- Date: Tue, 24 Jan 2023 08:48:12 GMT
- Title: Same or Different? Diff-Vectors for Authorship Analysis
- Authors: Silvia Corbara and Alejandro Moreo and Fabrizio Sebastiani
- Abstract summary: In classic'' authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document.
Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd.
- Score: 78.83284164605473
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the effects on authorship identification tasks of a
fundamental shift in how to conceive the vectorial representations of documents
that are given as input to a supervised learner. In ``classic'' authorship
analysis a feature vector represents a document, the value of a feature
represents (an increasing function of) the relative frequency of the feature in
the document, and the class label represents the author of the document. We
instead investigate the situation in which a feature vector represents an
unordered pair of documents, the value of a feature represents the absolute
difference in the relative frequencies (or increasing functions thereof) of the
feature in the two documents, and the class label indicates whether the two
documents are from the same author or not. This latter (learner-independent)
type of representation has been occasionally used before, but has never been
studied systematically. We argue that it is advantageous, and that in some
cases (e.g., authorship verification) it provides a much larger quantity of
information to the training process than the standard representation. The
experiments that we carry out on several publicly available datasets (among
which one that we here make available for the first time) show that feature
vectors representing pairs of documents (that we here call Diff-Vectors) bring
about systematic improvements in the effectiveness of authorship identification
tasks, and especially so when training data are scarce (as it is often the case
in real-life authorship identification scenarios). Our experiments tackle
same-author verification, authorship verification, and closed-set authorship
attribution; while DVs are naturally geared for solving the 1st, we also
provide two novel methods for solving the 2nd and 3rd that use a solver for the
1st as a building block.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - SelfDocSeg: A Self-Supervised vision-based Approach towards Document
Segmentation [15.953725529361874]
Document layout analysis is a known problem to the documents research community.
With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain.
We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches.
arXiv Detail & Related papers (2023-05-01T12:47:55Z) - Searching for Discriminative Words in Multidimensional Continuous
Feature Space [0.0]
We propose a novel method to extract discriminative keywords from documents.
We show how different discriminative metrics influence the overall results.
We conclude that word feature vectors can substantially improve the topical inference of documents' meaning.
arXiv Detail & Related papers (2022-11-26T18:05:11Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Identity Documents Authentication based on Forgery Detection of
Guilloche Pattern [2.606834301724095]
An authentication model for identity documents based on forgery detection of guilloche patterns is proposed.
Experiments are conducted in order to analyze and identify the most proper parameters to achieve higher authentication performance.
arXiv Detail & Related papers (2022-06-22T11:37:10Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Single versus Multiple Annotation for Named Entity Recognition of
Mutations [4.213427823201119]
We address the impact of using a single annotator vs two annotators, in order to measure whether multiple annotators are required.
Once we evaluate the performance loss when using a single annotator, we apply different methods to sample the training data for second annotation.
We use held-out double-annotated data to build two scenarios with different types of rankings: similarity-based and confidence based.
We evaluate both approaches on: (i) their ability to identify training instances that are erroneous, and (ii) on Mutation NER performance for state-of-the-art
arXiv Detail & Related papers (2021-01-19T03:54:17Z) - Evidence-Aware Inferential Text Generation with Vector Quantised
Variational AutoEncoder [104.25716317141321]
We propose an approach that automatically finds evidence for an event from a large text corpus, and leverages the evidence to guide the generation of inferential texts.
Our approach provides state-of-the-art performance on both Event2Mind and ATOMIC datasets.
arXiv Detail & Related papers (2020-06-15T02:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.