POSNoise: An Effective Countermeasure Against Topic Biases in Authorship
Analysis
- URL: http://arxiv.org/abs/2005.06605v2
- Date: Thu, 1 Jul 2021 09:16:06 GMT
- Title: POSNoise: An Effective Countermeasure Against Topic Biases in Authorship
Analysis
- Authors: Oren Halvani and Lukas Graner
- Abstract summary: Authorship verification is a fundamental research task in digital text forensics.
We propose a preprocessing technique called POSNoise, which effectively masks topic-related content in a given text.
Our evaluation shows that POSNoise leads to better results compared to a well-known topic masking approach in 34 out of 42 cases, with an increase in accuracy of up to 10%.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Authorship verification (AV) is a fundamental research task in digital text
forensics, which addresses the problem of whether two texts were written by the
same person. In recent years, a variety of AV methods have been proposed that
focus on this problem and can be divided into two categories: The first
category refers to such methods that are based on explicitly defined features,
where one has full control over which features are considered and what they
actually represent. The second category, on the other hand, relates to such AV
methods that are based on implicitly defined features, where no control
mechanism is involved, so that any character sequence in a text can serve as a
potential feature. However, AV methods belonging to the second category bear
the risk that the topic of the texts may bias their classification predictions,
which in turn may lead to misleading conclusions regarding their results. To
tackle this problem, we propose a preprocessing technique called POSNoise,
which effectively masks topic-related content in a given text. In this way, AV
methods are forced to focus on such text units that are more related to the
writing style. Our empirical evaluation based on six AV methods (falling into
the second category) and seven corpora shows that POSNoise leads to better
results compared to a well-known topic masking approach in 34 out of 42 cases,
with an increase in accuracy of up to 10%.
Related papers
- Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges [52.96987928118327]
We find that embedding models for retrieval, rerankers, and large language model (LLM) relevance judges are vulnerable to content injection attacks.
We identify two primary threats: (1) inserting unrelated or harmful content within passages that still appear deceptively "relevant", and (2) inserting entire queries or key query terms into passages to boost their perceived relevance.
Our study systematically examines the factors that influence an attack's success, such as the placement of injected content and the balance between relevant and non-relevant material.
arXiv Detail & Related papers (2025-01-30T18:02:15Z) - The \textit{Questio de aqua et terra}: A Computational Authorship Verification Study [49.56191463229252]
This study investigates the authenticity of the Questio via computational authorship verification (AV)
We build a family of AV systems and assemble a corpus of 330 13th- and 14th-century Latin texts.
The application of the AV system to the Questio returns a highly confident prediction concerning its authenticity.
arXiv Detail & Related papers (2025-01-07T18:42:05Z) - Controlling Out-of-Domain Gaps in LLMs for Genre Classification and Generated Text Detection [0.20482269513546458]
This study demonstrates that the modern generation of Large Language Models (LLMs) suffers from the same out-of-domain (OOD) performance gap observed in prior research on pre-trained Language Models (PLMs)
We introduce a method that controls which predictive indicators are used and which are excluded during classification.
This approach reduces the OOD gap by up to 20 percentage points in a few-shot setup.
arXiv Detail & Related papers (2024-12-29T21:54:39Z) - Detecting Statements in Text: A Domain-Agnostic Few-Shot Solution [1.3654846342364308]
State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce.
We propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task.
We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection.
arXiv Detail & Related papers (2024-05-09T12:03:38Z) - Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream [6.446062819763263]
The aim of this study is to conduct an extensive study on the best algorithms for topic detection.
The text of Persian social network posts is used as the dataset.
The results indicate that if we are searching for keyword-topics that are easily understandable by humans, the hybrid category is better.
arXiv Detail & Related papers (2024-03-15T12:08:58Z) - A Review of Adversarial Attack and Defense for Classification Methods [78.50824774203495]
This paper focuses on the generation and guarding of adversarial examples.
It is the hope of the authors that this paper will encourage more statisticians to work on this important and exciting field of generating and defending against adversarial examples.
arXiv Detail & Related papers (2021-11-18T22:13:43Z) - VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language
Matching [75.71523183166799]
The prevailing framework for matching multimodal inputs is based on a two-stage process.
We argue that these methods overlook an obvious emphmismatch between the roles of proposals in the two stages.
We propose VL-NMS, which is the first method to yield query-aware proposals at the first stage.
arXiv Detail & Related papers (2021-05-12T13:05:25Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z) - Deep Bayes Factor Scoring for Authorship Verification [10.405174977499497]
We present a hierarchical fusion of two well-known approaches into a single end-to-end learning procedure.
A deep metric learning framework at the bottom aims to learn a pseudo-metric that maps a document of variable length onto a fixed-sized feature vector.
At the top, we incorporate a probabilistic layer to perform Bayes factor scoring in the learned metric space.
arXiv Detail & Related papers (2020-08-23T21:00:33Z) - A Step Towards Interpretable Authorship Verification [0.0]
Authorship verification (AV) is a research branch in the field of digital text forensics.
Many approaches make use of features that are related to or influenced by the topic of the documents.
We propose an alternative AV approach that considers only topic-agnostic features in its classification decision.
arXiv Detail & Related papers (2020-06-22T16:44:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.