Anonymization of Documents for Law Enforcement with Machine Learning
- URL: http://arxiv.org/abs/2501.07334v1
- Date: Mon, 13 Jan 2025 13:47:00 GMT
- Title: Anonymization of Documents for Law Enforcement with Machine Learning
- Authors: Manuel Eberhardinger, Patrick Takenaka, Daniel Grießhaber, Johannes Maucher,
- Abstract summary: We present a system for automatically anonymizing images of scanned documents.
Our method considers the viability of further forensic processing after anonymization.
We show that our approach outperforms both a purely automatic redaction system and also a naive copy-paste scheme of the reference anonymization.
- Score: 1.237454174824584
- License:
- Abstract: The steadily increasing utilization of data-driven methods and approaches in areas that handle sensitive personal information such as in law enforcement mandates an ever increasing effort in these institutions to comply with data protection guidelines. In this work, we present a system for automatically anonymizing images of scanned documents, reducing manual effort while ensuring data protection compliance. Our method considers the viability of further forensic processing after anonymization by minimizing automatically redacted areas by combining automatic detection of sensitive regions with knowledge from a manually anonymized reference document. Using a self-supervised image model for instance retrieval of the reference document, our approach requires only one anonymized example to efficiently redact all documents of the same type, significantly reducing processing time. We show that our approach outperforms both a purely automatic redaction system and also a naive copy-paste scheme of the reference anonymization to other documents on a hand-crafted dataset of ground truth redactions.
Related papers
- Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs [67.0310240737424]
We introduce a novel approach to safeguard the ownership of text datasets and effectively detect unauthorized use by the RA-LLMs.
Our approach preserves the original data completely unchanged while protecting it by inserting specifically designed canary documents into the IP dataset.
During the detection process, unauthorized usage is identified by querying the canary documents and analyzing the responses of RA-LLMs.
arXiv Detail & Related papers (2025-02-15T04:56:45Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - RedactBuster: Entity Type Recognition from Redacted Documents [13.172863061928899]
We propose RedactBuster, the first deanonymization model using sentence context to perform Named Entity Recognition on reacted text.
We test RedactBuster against the most effective redaction technique and evaluate it using the publicly available Text Anonymization Benchmark (TAB)
Our results show accuracy values up to 0.985 regardless of the document nature or entity type.
arXiv Detail & Related papers (2024-04-19T16:42:44Z) - DECDM: Document Enhancement using Cycle-Consistent Diffusion Models [3.3813766129849845]
We propose DECDM, an end-to-end document-level image translation method inspired by recent advances in diffusion models.
Our method overcomes the limitations of paired training by independently training the source (noisy input) and target (clean output) models.
We also introduce simple data augmentation strategies to improve character-glyph conservation during translation.
arXiv Detail & Related papers (2023-11-16T07:16:02Z) - Automatic Anonymization of Swiss Federal Supreme Court Rulings [2.1963472367016426]
We enhance the existing anonymization software using a large dataset annotated with entities to be anonymized.
Our results show that using in-domain data to pre-train the models further improves the F1-score by more than 5% compared to existing models.
arXiv Detail & Related papers (2023-10-07T00:56:49Z) - DocMAE: Document Image Rectification via Self-supervised Representation
Learning [144.44748607192147]
We present DocMAE, a novel self-supervised framework for document image rectification.
We first mask random patches of the background-excluded document images and then reconstruct the missing pixels.
With such a self-supervised learning approach, the network is encouraged to learn the intrinsic structure of deformed documents.
arXiv Detail & Related papers (2023-04-20T14:27:15Z) - A False Sense of Privacy: Towards a Reliable Evaluation Methodology for the Anonymization of Biometric Data [8.799600976940678]
Biometric data contains distinctive human traits such as facial features or gait patterns.
Privacy protection is extensively afforded by the technique of anonymization.
We assess the state-of-the-art methods used to evaluate the performance of anonymization.
arXiv Detail & Related papers (2023-04-04T08:46:14Z) - Unsupervised Text Deidentification [101.2219634341714]
We propose an unsupervised deidentification method that masks words that leak personally-identifying information.
Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank.
arXiv Detail & Related papers (2022-10-20T18:54:39Z) - No Intruder, no Validity: Evaluation Criteria for Privacy-Preserving
Text Anonymization [0.48733623015338234]
We argue that researchers and practitioners developing automated text anonymization systems should carefully assess whether their evaluation methods truly reflect the system's ability to protect individuals from being re-identified.
We propose TILD, a set of evaluation criteria that comprises an anonymization method's technical performance, the information loss resulting from its anonymization, and the human ability to de-anonymize redacted documents.
arXiv Detail & Related papers (2021-03-16T18:18:29Z) - Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised
Deep Asymmetric Metric Learning [62.34197797857823]
A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds.
This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly.
Our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds.
arXiv Detail & Related papers (2020-03-23T03:22:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.