Cracking Double-Blind Review: Authorship Attribution with Deep Learning
- URL: http://arxiv.org/abs/2211.07467v3
- Date: Mon, 3 Jul 2023 12:49:54 GMT
- Title: Cracking Double-Blind Review: Authorship Attribution with Deep Learning
- Authors: Leonard Bauersfeld and Angel Romero and Manasi Muglikar and Davide
Scaramuzza
- Abstract summary: We propose a transformer-based, neural-network architecture to attribute an anonymous manuscript to an author.
We leverage all research papers publicly available on arXiv amounting to over 2 million manuscripts.
Our method achieves an unprecedented authorship attribution accuracy, where up to 73% of papers are attributed correctly.
- Score: 43.483063713471935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Double-blind peer review is considered a pillar of academic research because
it is perceived to ensure a fair, unbiased, and fact-centered scientific
discussion. Yet, experienced researchers can often correctly guess from which
research group an anonymous submission originates, biasing the peer-review
process. In this work, we present a transformer-based, neural-network
architecture that only uses the text content and the author names in the
bibliography to attribute an anonymous manuscript to an author. To train and
evaluate our method, we created the largest authorship identification dataset
to date. It leverages all research papers publicly available on arXiv amounting
to over 2 million manuscripts. In arXiv-subsets with up to 2,000 different
authors, our method achieves an unprecedented authorship attribution accuracy,
where up to 73% of papers are attributed correctly. We present a scaling
analysis to highlight the applicability of the proposed method to even larger
datasets when sufficient compute capabilities are more widely available to the
academic community. Furthermore, we analyze the attribution accuracy in
settings where the goal is to identify all authors of an anonymous manuscript.
Thanks to our method, we are not only able to predict the author of an
anonymous work, but we also provide empirical evidence of the key aspects that
make a paper attributable. We have open-sourced the necessary tools to
reproduce our experiments.
Related papers
- Deep Author Name Disambiguation using DBLP Data [7.081604594416337]
Author Name Ambiguity (ANA) is considered a critical open problem in digital libraries.
This paper proposes an Author Name Disambiguation (AND) approach that links author names to their real-world entities.
arXiv Detail & Related papers (2023-03-17T15:50:00Z) - arXivEdits: Understanding the Human Revision Process in Scientific
Writing [17.63505461444103]
We provide a complete computational framework for studying text revision in scientific writing.
We first introduce arXivEdits, a new annotated corpus of 751 full papers from arXiv with gold sentence alignment across their multiple versions of revision.
It supports our data-driven analysis to unveil the common strategies practiced by researchers for revising their papers.
arXiv Detail & Related papers (2022-10-26T22:50:24Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Tag-Aware Document Representation for Research Paper Recommendation [68.8204255655161]
We propose a hybrid approach that leverages deep semantic representation of research papers based on social tags assigned by users.
The proposed model is effective in recommending research papers even when the rating data is very sparse.
arXiv Detail & Related papers (2022-09-08T09:13:07Z) - Whois? Deep Author Name Disambiguation using Bibliographic Data [7.081604594416337]
Author Name Ambiguity (ANA) is considered a critical open problem in digital libraries.
This paper proposes an Author Name Disambiguation (AND) approach that links author names to their real-world entities.
arXiv Detail & Related papers (2022-07-11T11:03:39Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Bib2Auth: Deep Learning Approach for Author Disambiguation using
Bibliographic Data [4.817368273632451]
We propose a novel approach to link author names to their real-world entities by relying on their co-authorship pattern and area of research.
Our supervised deep learning model identifies an author by capturing his/her relationship with his/her co-authors and area of research.
Bib2Auth has shown good performance on a relatively large dataset.
arXiv Detail & Related papers (2021-07-09T12:25:11Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - Automatic generation of reviews of scientific papers [1.1999555634662633]
We present a method for the automatic generation of a review paper corresponding to a user-defined query.
The first part identifies key papers in the area by their bibliometric parameters, such as a graph of co-citations.
The second stage uses a BERT based architecture that we train on existing reviews for extractive summarization of these key papers.
arXiv Detail & Related papers (2020-10-08T17:47:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.