VeriDark: A Large-Scale Benchmark for Authorship Verification on the
Dark Web
- URL: http://arxiv.org/abs/2207.03477v1
- Date: Thu, 7 Jul 2022 17:57:11 GMT
- Title: VeriDark: A Large-Scale Benchmark for Authorship Verification on the
Dark Web
- Authors: Andrei Manolache, Florin Brad, Antonio Barbalau, Radu Tudor Ionescu,
Marius Popescu
- Abstract summary: We release VeriDark: a benchmark comprised of three large scale authorship verification datasets and one authorship identification dataset.
We evaluate competitive NLP baselines on the three datasets and perform an analysis of the predictions to better understand the limitations of such approaches.
- Score: 25.00969884543201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The DarkWeb represents a hotbed for illicit activity, where users communicate
on different market forums in order to exchange goods and services. Law
enforcement agencies benefit from forensic tools that perform authorship
analysis, in order to identify and profile users based on their textual
content. However, authorship analysis has been traditionally studied using
corpora featuring literary texts such as fragments from novels or fan fiction,
which may not be suitable in a cybercrime context. Moreover, the few works that
employ authorship analysis tools for cybercrime prevention usually employ
ad-hoc experimental setups and datasets. To address these issues, we release
VeriDark: a benchmark comprised of three large scale authorship verification
datasets and one authorship identification dataset obtained from user activity
from either Dark Web related Reddit communities or popular illicit Dark Web
market forums. We evaluate competitive NLP baselines on the three datasets and
perform an analysis of the predictions to better understand the limitations of
such approaches. We make the datasets and baselines publicly available at
https://github.com/bit-ml/VeriDark
Related papers
- A Public and Reproducible Assessment of the Topics API on Real Data [1.1510009152620668]
The Topics API for the web is Google's privacy-enhancing alternative to replace third-party cookies.
Results of prior work have led to an ongoing discussion about the capability of Topics to trade off both utility and privacy.
This paper shows on real data that Topics does not provide the same privacy guarantees to all users and that the information leakage worsens over time.
arXiv Detail & Related papers (2024-03-28T17:03:44Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - JAMDEC: Unsupervised Authorship Obfuscation using Constrained Decoding
over Small Language Models [53.83273575102087]
We propose an unsupervised inference-time approach to authorship obfuscation.
We introduce JAMDEC, a user-controlled, inference-time algorithm for authorship obfuscation.
Our approach builds on small language models such as GPT2-XL in order to help avoid disclosing the original content to proprietary LLM's APIs.
arXiv Detail & Related papers (2024-02-13T19:54:29Z) - What's In My Big Data? [67.04525616289949]
We propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora.
WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node.
Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content.
arXiv Detail & Related papers (2023-10-31T17:59:38Z) - The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI [41.32981860191232]
Legal and machine learning experts to systematically audit and trace 1800+ text datasets.
Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets.
frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
arXiv Detail & Related papers (2023-10-25T17:20:26Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - LG4AV: Combining Language Models and Graph Neural Networks for Author
Verification [0.11421942894219898]
We present our novel approach LG4AV which combines language models and graph neural networks for authorship verification.
By directly feeding the available texts in a pre-trained transformer architecture, our model does not need any hand-crafted stylometric features.
Our model can benefit from relations between authors that are meaningful with respect to the verification process.
arXiv Detail & Related papers (2021-09-03T12:45:28Z) - \textit{StateCensusLaws.org}: A Web Application for Consuming and
Annotating Legal Discourse Learning [89.77347919191774]
We create a web application to highlight the output of NLP models trained to parse and label discourse segments in law text.
We focus on state-level law that uses U.S. Census population numbers to allocate resources and organize government.
arXiv Detail & Related papers (2021-04-20T22:00:54Z) - Birdspotter: A Tool for Analyzing and Labeling Twitter Users [12.558187319452657]
Birdspotter is a tool to analyze and label Twitter users.
Birdspotter.ml is an exploratory visualizer for the computed metrics.
We show how to train birdspotter into a fully-fledged bot detector.
arXiv Detail & Related papers (2020-12-04T02:25:07Z) - Linked Credibility Reviews for Explainable Misinformation Detection [1.713291434132985]
We propose an architecture based on a core concept of Credibility Reviews (CRs) that can be used to build networks of distributed bots that collaborate for misinformation detection.
CRs serve as building blocks to compose graphs of (i) web content, (ii) existing credibility signals --fact-checked claims and reputation reviews of websites--, and (iii) automatically computed reviews.
We implement this architecture on top of lightweight extensions to.org and services providing generic NLP tasks for semantic similarity and stance detection.
arXiv Detail & Related papers (2020-08-28T16:55:43Z) - MedLatinEpi and MedLatinLit: Two Datasets for the Computational
Authorship Analysis of Medieval Latin Texts [72.16295267480838]
We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis.
MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects.
arXiv Detail & Related papers (2020-06-22T14:22:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.