Related papers: Capturing Style in Author and Document Representation

Capturing Style in Author and Document Representation

URL: http://arxiv.org/abs/2407.13358v1
Date: Thu, 18 Jul 2024 10:01:09 GMT
Title: Capturing Style in Author and Document Representation
Authors: Enzo Terreau, Antoine Gourru, Julien Velcin,
Abstract summary: We propose a new architecture that learns embeddings for both authors and documents with a stylistic constraint. We evaluate our method on three datasets: a literary corpus extracted from the Gutenberg Project, the Blog Authorship and IMDb62.
Score: 4.323709559692927
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A wide range of Deep Natural Language Processing (NLP) models integrates continuous and low dimensional representations of words and documents. Surprisingly, very few models study representation learning for authors. These representations can be used for many NLP tasks, such as author identification and classification, or in recommendation systems. A strong limitation of existing works is that they do not explicitly capture writing style, making them hardly applicable to literary data. We therefore propose a new architecture based on Variational Information Bottleneck (VIB) that learns embeddings for both authors and documents with a stylistic constraint. Our model fine-tunes a pre-trained document encoder. We stimulate the detection of writing style by adding predefined stylistic features making the representation axis interpretable with respect to writing style indicators. We evaluate our method on three datasets: a literary corpus extracted from the Gutenberg Project, the Blog Authorship Corpus and IMDb62, for which we show that it matches or outperforms strong/recent baselines in authorship attribution while capturing much more accurately the authors stylistic aspects.

Related papers

Personalized Image Generation from an Author Writing Style [0.29998889086656577]
Translating nuanced, textually-defined authorial writing styles into compelling visual representations presents a novel challenge in generative AI.<n>This paper introduces a pipeline that leverages Author Writing Sheets (AWS) as input to a Large Language Model (LLM)<n>We evaluated our approach using 49 author styles from Reddit data, with human evaluators assessing the stylistic match and visual distinctiveness of the generated images.
arXiv Detail & Related papers (2025-07-04T05:53:48Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Can Authorship Representation Learning Capture Stylistic Features? [5.812943049068866]
We show that representations learned for a surrogate authorship prediction task are indeed sensitive to writing style. As a consequence, authorship representations may be expected to be robust to certain kinds of data shift, such as topic drift over time. Our findings may open the door to downstream applications that require stylistic representations, such as style transfer.
arXiv Detail & Related papers (2023-08-22T15:10:45Z)
Learning Interpretable Style Embeddings via Prompting LLMs [46.74488355350601]
Style representation learning builds content-independent representations of author style in text. Current style representation learning uses neural methods to disentangle style from content to create style vectors. We use prompting to perform stylometry on a large number of texts to create a synthetic dataset and train human-interpretable style representations.
arXiv Detail & Related papers (2023-05-22T04:07:54Z)
ALADIN-NST: Self-supervised disentangled representation learning of artistic style through Neural Style Transfer [60.6863849241972]
We learn a representation of visual artistic style more strongly disentangled from the semantic content depicted in an image. We show that strongly addressing the disentanglement of style and content leads to large gains in style-specific metrics.
arXiv Detail & Related papers (2023-04-12T10:33:18Z)
Decoding the End-to-end Writing Trajectory in Scholarly Manuscripts [7.294418916091011]
We introduce a novel taxonomy that categorizes scholarly writing behaviors according to intention, writer actions, and the information types of the written data. Motivated by cognitive writing theory, our taxonomy for scientific papers includes three levels of categorization in order to trace the general writing flow. ManuScript intends to provide a complete picture of the scholarly writing process by capturing the linearity and non-linearity of writing trajectory.
arXiv Detail & Related papers (2023-03-31T20:33:03Z)
PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z)
Letter-level Online Writer Identification [86.13203975836556]
We focus on a novel problem, letter-level online writer-id, which requires only a few trajectories of written letters as identification cues. A main challenge is that a person often writes a letter in different styles from time to time. We refer to this problem as the variance of online writing styles (Var-O-Styles)
arXiv Detail & Related papers (2021-12-06T07:21:53Z)
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources. Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision. We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
DRAG: Director-Generator Language Modelling Framework for Non-Parallel Author Stylized Rewriting [9.275464023441227]
Author stylized rewriting is the task of rewriting an input text in a particular author's style. We propose a Director-Generator framework to rewrite content in the target author's style.
arXiv Detail & Related papers (2021-01-28T06:52:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.