Enhancing Representation Generalization in Authorship Identification
- URL: http://arxiv.org/abs/2310.00436v1
- Date: Sat, 30 Sep 2023 17:11:00 GMT
- Title: Enhancing Representation Generalization in Authorship Identification
- Authors: Haining Wang
- Abstract summary: Authorship identification ascertains the authorship of texts whose origins remain undisclosed.
Modern authorship identification methods have proven effective in distinguishing authorial styles.
The presented work addresses the challenge of enhancing the generalization of stylistic representations in authorship identification.
- Score: 9.148691357200216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Authorship identification ascertains the authorship of texts whose origins
remain undisclosed. That authorship identification techniques work as reliably
as they do has been attributed to the fact that authorial style is properly
captured and represented. Although modern authorship identification methods
have evolved significantly over the years and have proven effective in
distinguishing authorial styles, the generalization of stylistic features
across domains has not been systematically reviewed. The presented work
addresses the challenge of enhancing the generalization of stylistic
representations in authorship identification, particularly when there are
discrepancies between training and testing samples. A comprehensive review of
empirical studies was conducted, focusing on various stylistic features and
their effectiveness in representing an author's style. The influencing factors
such as topic, genre, and register on writing style were also explored, along
with strategies to mitigate their impact. While some stylistic features, like
character n-grams and function words, have proven to be robust and
discriminative, others, such as content words, can introduce biases and hinder
cross-domain generalization. Representations learned using deep learning
models, especially those incorporating character n-grams and syntactic
information, show promise in enhancing representation generalization. The
findings underscore the importance of selecting appropriate stylistic features
for authorship identification, especially in cross-domain scenarios. The
recognition of the strengths and weaknesses of various linguistic features
paves the way for more accurate authorship identification in diverse contexts.
Related papers
- Distinguishing Fictional Voices: a Study of Authorship Verification
Models for Quotation Attribution [12.300285585201767]
We explore stylistic representations of characters built by encoding their quotes with off-the-shelf pretrained Authorship Verification models.
Results suggest that the combination of stylistic and topical information captured in some of these models accurately distinguish characters among each other, but does not necessarily improve over semantic-only models when attributing quotes.
arXiv Detail & Related papers (2024-01-30T12:49:40Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Can Authorship Representation Learning Capture Stylistic Features? [5.812943049068866]
We show that representations learned for a surrogate authorship prediction task are indeed sensitive to writing style.
As a consequence, authorship representations may be expected to be robust to certain kinds of data shift, such as topic drift over time.
Our findings may open the door to downstream applications that require stylistic representations, such as style transfer.
arXiv Detail & Related papers (2023-08-22T15:10:45Z) - ALADIN-NST: Self-supervised disentangled representation learning of
artistic style through Neural Style Transfer [60.6863849241972]
We learn a representation of visual artistic style more strongly disentangled from the semantic content depicted in an image.
We show that strongly addressing the disentanglement of style and content leads to large gains in style-specific metrics.
arXiv Detail & Related papers (2023-04-12T10:33:18Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - TaCo: Textual Attribute Recognition via Contrastive Learning [9.042957048594825]
TaCo is a contrastive framework for textual attribute recognition tailored toward the most common document scenes.
We design the learning paradigm from three perspectives: 1) generating attribute views, 2) extracting subtle but crucial details, and 3) exploiting valued view pairs for learning.
Experiments show that TaCo surpasses the supervised counterparts and advances the state-of-the-art remarkably on multiple attribute recognition tasks.
arXiv Detail & Related papers (2022-08-22T09:45:34Z) - Toward Understanding WordArt: Corner-Guided Transformer for Scene Text
Recognition [63.6608759501803]
We propose to recognize artistic text at three levels.
corner points are applied to guide the extraction of local features inside characters, considering the robustness of corner structures to appearance and shape.
Secondly, we design a character contrastive loss to model the character-level feature, improving the feature representation for character classification.
Thirdly, we utilize Transformer to learn the global feature on image-level and model the global relationship of the corner points.
arXiv Detail & Related papers (2022-07-31T14:11:05Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Idiosyncratic but not Arbitrary: Learning Idiolects in Online Registers
Reveals Distinctive yet Consistent Individual Styles [7.4037154707453965]
We introduce a new approach to studying idiolects through a massive cross-author comparison to identify and encode stylistic features.
A neural model achieves strong performance at authorship identification on short texts.
We quantify the relative contributions of different linguistic elements to idiolectal variation.
arXiv Detail & Related papers (2021-09-07T15:49:23Z) - Spectral Graph-based Features for Recognition of Handwritten Characters:
A Case Study on Handwritten Devanagari Numerals [0.0]
We propose an approach that exploits the robust graph representation and spectral graph embedding concept to represent handwritten characters.
For corroboration of the efficacy of the proposed method, extensive experiments were carried out on the standard handwritten numeral Computer Vision Pattern Recognition, Unit of Indian Statistical Institute Kolkata dataset.
arXiv Detail & Related papers (2020-07-07T08:40:08Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.