Isolating authorship from content with semantic embeddings and contrastive learning
- URL: http://arxiv.org/abs/2411.18472v1
- Date: Wed, 27 Nov 2024 16:08:46 GMT
- Title: Isolating authorship from content with semantic embeddings and contrastive learning
- Authors: Javier Huertas-Tato, Adrián Girón-Jiménez, Alejandro Martín, David Camacho,
- Abstract summary: Authorship has entangled style and content inside.
We present a technique to use contrastive learning with additional hard negatives synthetically created using a semantic similarity model.
This disentanglement technique aims to distance the content embedding space from the style embedding space, leading to embeddings more informed by style.
- Score: 49.15148871877941
- License:
- Abstract: Authorship has entangled style and content inside. Authors frequently write about the same topics in the same style, so when different authors write about the exact same topic the easiest way out to distinguish them is by understanding the nuances of their style. Modern neural models for authorship can pick up these features using contrastive learning, however, some amount of content leakage is always present. Our aim is to reduce the inevitable impact and correlation between content and authorship. We present a technique to use contrastive learning (InfoNCE) with additional hard negatives synthetically created using a semantic similarity model. This disentanglement technique aims to distance the content embedding space from the style embedding space, leading to embeddings more informed by style. We demonstrate the performance with ablations on two different datasets and compare them on out-of-domain challenges. Improvements are clearly shown on challenging evaluations on prolific authors with up to a 10% increase in accuracy when the settings are particularly hard. Trials on challenges also demonstrate the preservation of zero-shot capabilities of this method as fine tuning.
Related papers
- Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing [71.29488677105127]
Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters.
We propose a contrastive learning-based STR framework by leveraging synthetic and real unlabeled data without any human cost.
Our method achieves SOTA performance (94.7% and 70.9% average accuracy on common benchmarks and Union14M-Benchmark.
arXiv Detail & Related papers (2024-11-23T15:24:47Z) - StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples [48.44036251656947]
Style representations aim to embed texts with similar writing styles closely and texts with different styles far apart, regardless of content.
We introduce StyleDistance, a novel approach to training stronger content-independent style embeddings.
arXiv Detail & Related papers (2024-10-16T17:25:25Z) - Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations [63.73044203154743]
Self-supervised representation learning often uses data augmentations to induce "style" attributes of the data.
It is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded.
We introduce a more principled approach that seeks to disentangle style features rather than discard them.
arXiv Detail & Related papers (2023-11-15T09:34:08Z) - ALADIN-NST: Self-supervised disentangled representation learning of
artistic style through Neural Style Transfer [60.6863849241972]
We learn a representation of visual artistic style more strongly disentangled from the semantic content depicted in an image.
We show that strongly addressing the disentanglement of style and content leads to large gains in style-specific metrics.
arXiv Detail & Related papers (2023-04-12T10:33:18Z) - Whodunit? Learning to Contrast for Authorship Attribution [22.37948005237967]
Authorship attribution is the task of identifying the author of a given text.
We propose to fine-tune pre-trained language representations using a combination of contrastive learning and supervised learning.
We show that Contra-X advances the state-of-the-art on multiple human and machine authorship attribution benchmarks.
arXiv Detail & Related papers (2022-09-23T23:45:08Z) - CLLD: Contrastive Learning with Label Distance for Text Classificatioin [0.6299766708197883]
We propose Contrastive Learning with Label Distance (CLLD) for learning contrastive classes.
CLLD ensures the flexibility within the subtle differences that lead to different label assignments.
Our experiments suggest that the learned label distance relieve the adversarial nature of interclasses.
arXiv Detail & Related papers (2021-10-25T07:07:14Z) - Improving Disentangled Text Representation Learning with
Information-Theoretic Guidance [99.68851329919858]
discrete nature of natural language makes disentangling of textual representations more challenging.
Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text.
Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation.
arXiv Detail & Related papers (2020-06-01T03:36:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.