Preserving Privacy Without Compromising Accuracy: Machine Unlearning for Handwritten Text Recognition
- URL: http://arxiv.org/abs/2504.08616v1
- Date: Fri, 11 Apr 2025 15:21:12 GMT
- Title: Preserving Privacy Without Compromising Accuracy: Machine Unlearning for Handwritten Text Recognition
- Authors: Lei Kang, Xuanshuo Fu, Lluis Gomez, Alicia Fornés, Ernest Valveny, Dimosthenis Karatzas,
- Abstract summary: Handwritten Text Recognition (HTR) is essential for document analysis and digitization.<n>Legislation like the right to be forgotten'' underscores the necessity for methods that can expunge sensitive information from trained models.<n>We introduce a novel two-stage unlearning strategy for a multi-head transformer-based HTR model, integrating pruning and random labeling.
- Score: 12.228611784356412
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Handwritten Text Recognition (HTR) is essential for document analysis and digitization. However, handwritten data often contains user-identifiable information, such as unique handwriting styles and personal lexicon choices, which can compromise privacy and erode trust in AI services. Legislation like the ``right to be forgotten'' underscores the necessity for methods that can expunge sensitive information from trained models. Machine unlearning addresses this by selectively removing specific data from models without necessitating complete retraining. Yet, it frequently encounters a privacy-accuracy tradeoff, where safeguarding privacy leads to diminished model performance. In this paper, we introduce a novel two-stage unlearning strategy for a multi-head transformer-based HTR model, integrating pruning and random labeling. Our proposed method utilizes a writer classification head both as an indicator and a trigger for unlearning, while maintaining the efficacy of the recognition head. To our knowledge, this represents the first comprehensive exploration of machine unlearning within HTR tasks. We further employ Membership Inference Attacks (MIA) to evaluate the effectiveness of unlearning user-identifiable information. Extensive experiments demonstrate that our approach effectively preserves privacy while maintaining model accuracy, paving the way for new research directions in the document analysis community. Our code will be publicly available upon acceptance.
Related papers
- Enhancing IMU-Based Online Handwriting Recognition via Contrastive Learning with Zero Inference Overhead [4.519836503888727]
We propose a training framework designed to improve feature representation and recognition accuracy without increasing inference costs.<n>ECHWR utilizes a temporary auxiliary branch that aligns sensor signals with semantic text embeddings during the training phase.<n> Evaluations on the OnHW-Words500 dataset show that ECHWR significantly outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-04T13:44:54Z) - LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data [69.5099112089508]
Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data.<n>This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets.<n>We find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved.
arXiv Detail & Related papers (2025-10-10T05:10:49Z) - Differential Privacy in Machine Learning: From Symbolic AI to LLMs [49.1574468325115]
Differential privacy provides a formal framework to mitigate privacy risks.<n>It ensures that the inclusion or exclusion of any single data point does not significantly alter the output of an algorithm.
arXiv Detail & Related papers (2025-06-13T11:30:35Z) - MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning [6.274266343486906]
Traditional handwritten text recognition methods lack writer-specific personalization at test time.<n>We propose an efficient framework that formulates personalization as prompt tuning.<n>We validate our approach on the RIMES and IAM Handwriting Database benchmarks.
arXiv Detail & Related papers (2025-05-26T20:26:16Z) - Privacy-Preserving Biometric Verification with Handwritten Random Digit String [49.77172854374479]
Handwriting verification has stood as a steadfast identity authentication method for decades.<n>However, this technique risks potential privacy breaches due to the inclusion of personal information in handwritten biometrics such as signatures.<n>We propose using the Random Digit String (RDS) for privacy-preserving handwriting verification.
arXiv Detail & Related papers (2025-03-17T03:47:25Z) - Technical Report for the Forgotten-by-Design Project: Targeted Obfuscation for Machine Learning [0.03749861135832072]
This paper explores the concept of the Right to be Forgotten (RTBF) within AI systems, contrasting it with traditional data erasure methods.<n>We introduce Forgotten by Design, a proactive approach to privacy preservation that integrates instance-specific obfuscation techniques.<n>Our experiments on the CIFAR-10 dataset demonstrate that our techniques reduce privacy risks by at least an order of magnitude while maintaining model accuracy.
arXiv Detail & Related papers (2025-01-20T15:07:59Z) - RESTOR: Knowledge Recovery through Machine Unlearning [71.75834077528305]
Large language models trained on web-scale corpora can memorize undesirable datapoints.<n>Many machine unlearning algorithms have been proposed that aim to erase' these datapoints.<n>We propose the RESTOR framework for machine unlearning, which evaluates the ability of unlearning algorithms to perform targeted data erasure.
arXiv Detail & Related papers (2024-10-31T20:54:35Z) - MUSE: Machine Unlearning Six-Way Evaluation for Language Models [109.76505405962783]
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content.
We propose MUSE, a comprehensive machine unlearning evaluation benchmark.
We benchmark how effectively eight popular unlearning algorithms can unlearn Harry Potter books and news articles.
arXiv Detail & Related papers (2024-07-08T23:47:29Z) - IDT: Dual-Task Adversarial Attacks for Privacy Protection [8.312362092693377]
Methods to protect privacy can involve using representations inside models that are not to detect sensitive attributes.
We propose IDT, a method that analyses predictions made by auxiliary and interpretable models to identify which tokens are important to change.
We evaluate different datasets for NLP suitable for different tasks.
arXiv Detail & Related papers (2024-06-28T04:14:35Z) - Federated Face Forgery Detection Learning with Personalized Representation [63.90408023506508]
Deep generator technology can produce high-quality fake videos that are indistinguishable, posing a serious social threat.
Traditional forgery detection methods directly centralized training on data.
The paper proposes a novel federated face forgery detection learning with personalized representation.
arXiv Detail & Related papers (2024-06-17T02:20:30Z) - Classification of Non-native Handwritten Characters Using Convolutional Neural Network [0.0]
The classification of English characters written by non-native users is performed by proposing a custom-tailored CNN model.
We train this CNN with a new dataset called the handwritten isolated English character dataset.
The proposed model with five convolutional layers and one hidden layer outperforms state-of-the-art models in terms of character recognition accuracy.
arXiv Detail & Related papers (2024-06-06T21:08:07Z) - Machine Unlearning for Document Classification [14.71726430657162]
A novel approach, known as machine unlearning, has emerged to make AI models forget about a particular class of data.
This work represents a pioneering step towards the development of machine unlearning methods aimed at addressing privacy concerns in document analysis applications.
arXiv Detail & Related papers (2024-04-29T18:16:13Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - Offline Detection of Misspelled Handwritten Words by Convolving
Recognition Model Features with Text Labels [0.0]
We introduce the task of comparing a handwriting image to text.
Our model's classification head is trained entirely on synthetic data created using a state-of-the-art generative adversarial network.
Such massive performance gains can lead to significant productivity increases in applications utilizing human-in-the-loop automation.
arXiv Detail & Related papers (2023-09-18T21:13:42Z) - Independent Distribution Regularization for Private Graph Embedding [55.24441467292359]
Graph embeddings are susceptible to attribute inference attacks, which allow attackers to infer private node attributes from the learned graph embeddings.
To address these concerns, privacy-preserving graph embedding methods have emerged.
We propose a novel approach called Private Variational Graph AutoEncoders (PVGAE) with the aid of independent distribution penalty as a regularization term.
arXiv Detail & Related papers (2023-08-16T13:32:43Z) - CSSL-RHA: Contrastive Self-Supervised Learning for Robust Handwriting
Authentication [23.565017967901618]
We propose a novel Contrastive Self-Supervised Learning framework for Robust Handwriting Authentication.
It can dynamically learn complex yet important features and accurately predict writer identities.
Our proposed model can still effectively achieve authentication even under abnormal circumstances, such as data falsification and corruption.
arXiv Detail & Related papers (2023-07-18T02:20:46Z) - Uncovering the Handwritten Text in the Margins: End-to-end Handwritten
Text Detection and Recognition [0.840835093659811]
This work presents an end-to-end framework for automatic detection and recognition of handwritten marginalia.
It uses data augmentation and transfer learning to overcome training data scarcity.
The effectiveness of the proposed framework has been empirically evaluated on the data from early book collections found in the Uppsala University Library in Sweden.
arXiv Detail & Related papers (2023-03-10T14:00:53Z) - Pre-trained Encoders in Self-Supervised Learning Improve Secure and
Privacy-preserving Supervised Learning [63.45532264721498]
Self-supervised learning is an emerging technique to pre-train encoders using unlabeled data.
We perform first systematic, principled measurement study to understand whether and when a pretrained encoder can address the limitations of secure or privacy-preserving supervised learning algorithms.
arXiv Detail & Related papers (2022-12-06T21:35:35Z) - Unintended Memorization and Timing Attacks in Named Entity Recognition
Models [5.404816271595691]
We study the setting when NER models are available as a black-box service for identifying sensitive information in user documents.
With updated pre-trained NER models from spaCy, we demonstrate two distinct membership attacks on these models.
arXiv Detail & Related papers (2022-11-04T03:32:16Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - SmartPatch: Improving Handwritten Word Imitation with Patch
Discriminators [67.54204685189255]
We propose SmartPatch, a new technique increasing the performance of current state-of-the-art methods.
We combine the well-known patch loss with information gathered from the parallel trained handwritten text recognition system.
This leads to a more enhanced local discriminator and results in more realistic and higher-quality generated handwritten words.
arXiv Detail & Related papers (2021-05-21T18:34:21Z) - Differentially Private and Fair Deep Learning: A Lagrangian Dual
Approach [54.32266555843765]
This paper studies a model that protects the privacy of the individuals sensitive information while also allowing it to learn non-discriminatory predictors.
The method relies on the notion of differential privacy and the use of Lagrangian duality to design neural networks that can accommodate fairness constraints.
arXiv Detail & Related papers (2020-09-26T10:50:33Z) - Separating Content from Style Using Adversarial Learning for Recognizing
Text in the Wild [103.51604161298512]
We propose an adversarial learning framework for the generation and recognition of multiple characters in an image.
Our framework can be integrated into recent recognition methods to achieve new state-of-the-art recognition accuracy.
arXiv Detail & Related papers (2020-01-13T12:41:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.