Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation
- URL: http://arxiv.org/abs/2511.18272v1
- Date: Sun, 23 Nov 2025 03:45:22 GMT
- Title: Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation
- Authors: Richard J. Young,
- Abstract summary: Vision-language models (VLMs) are increasingly deployed for optical character recognition (OCR) in healthcare settings.<n>This work presents the first systematic evaluation of inference-time vision token masking as a privacy-preserving mechanism for medical document OCR using DeepSeek-OCR.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision-language models (VLMs) are increasingly deployed for optical character recognition (OCR) in healthcare settings, raising critical concerns about protected health information (PHI) exposure during document processing. This work presents the first systematic evaluation of inference-time vision token masking as a privacy-preserving mechanism for medical document OCR using DeepSeek-OCR. We introduce seven masking strategies (V3-V9) targeting different architectural layers (SAM encoder blocks, compression layers, dual vision encoders, projector fusion) and evaluate PHI reduction across HIPAA-defined categories using 100 synthetic medical billing statements (drawn from a corpus of 38,517 annotated documents) with perfect ground-truth annotations. All masking strategies converge to 42.9% PHI reduction, successfully suppressing long-form spatially-distributed identifiers (patient names, dates of birth, physical addresses at 100% effectiveness) while failing to prevent short structured identifiers (medical record numbers, social security numbers, email addresses, account numbers at 0% effectiveness). Ablation studies varying mask expansion radius (r=1,2,3) demonstrate that increased spatial coverage does not improve reduction beyond this ceiling, indicating that language model contextual inference - not insufficient visual masking - drives structured identifier leakage. A simulated hybrid architecture combining vision masking with NLP post-processing achieves 88.6% total PHI reduction (assuming 80% NLP accuracy on remaining identifiers). This negative result establishes boundaries for vision-only privacy interventions in VLMs, provides guidance distinguishing PHI types amenable to vision-level versus language-level redaction, and redirects future research toward decoder-level fine-tuning and hybrid defense-in-depth architectures for HIPAA-compliant medical document processing.
Related papers
- SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy [53.75084833636302]
We propose SIDeR, a Semantic decoupling-driven framework for unrestricted face privacy protection.<n> SIDeR decomposes a facial image into a machine-recognizable identity feature vector and a visually perceptible semantic appearance component.<n>For authorized access, SIDeR can be restored to its original form when the correct password is provided.
arXiv Detail & Related papers (2026-02-04T19:30:48Z) - Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI [5.6285415648839425]
Collaborative machine learning promises improved diagnostic accuracy by leveraging diverse datasets, yet privacy regulations such as HIPAA prohibit direct patient data sharing.<n>This paper presents a privacy-preserving federated learning framework combining Vision Transformers (ViT) with homomorphic encryption (HE) for secure multi-institutional histopathology classification.
arXiv Detail & Related papers (2025-11-26T02:27:40Z) - Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis [2.6554246520306624]
Mask What Matters is a controllable text-guided masking framework for self-supervised medical image analysis.<n>It consistently outperforms existing MIM methods, achieving gains of up to +3.1 percentage points in classification accuracy.<n>It achieves these improvements with substantially lower overall masking ratios.
arXiv Detail & Related papers (2025-09-27T02:26:56Z) - Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models [7.916129615051081]
We introduce a dataset comprising over 34,000 synthetic images generated by diffusion models.<n>The dataset includes 214 human-annotated images that serve as a gold-standard reference for validation.
arXiv Detail & Related papers (2025-06-25T07:06:29Z) - VSF-Med:A Vulnerability Scoring Framework for Medical Vision-Language Models [6.390468088226493]
We introduce VSF--Med, an end-to-end vulnerability-scoring framework for medical Vision Language Models (VLMs)<n>VSF--Med synthesizes over 30,000 adversarial variants from 5,000 radiology images and enables reproducible benchmarking of any medical VLM with a single command.<n>We show that Llama-3.2-11B-Vision-Instruct exhibits a peak vulnerability increase of $1.29sigma$ for persistence-of-attack-effects, while GPT-4o shows increases of $0.69sigma$ for that same vector and $0.28sigma$ for prompt-injection attacks.
arXiv Detail & Related papers (2025-06-25T02:56:38Z) - Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs [51.93737995405164]
Large Vision-Language Models (LVLMs) are susceptible to hallucinations.<n>We introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy.<n>We show that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
arXiv Detail & Related papers (2025-05-26T08:36:10Z) - Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait [70.00430652562012]
FarSight is an end-to-end system for person recognition that integrates biometric cues across face, gait, and body shape modalities.<n>FarSight incorporates novel algorithms across four core modules: multi-subject detection and tracking, recognition-aware video restoration, modality-specific biometric feature encoding, and quality-guided multi-modal fusion.
arXiv Detail & Related papers (2025-05-07T17:58:25Z) - Unsupervised learning of Data-driven Facial Expression Coding System (DFECS) using keypoint tracking [3.0605062268685868]
We propose an unsupervised learning of an automated facial coding system by leveraging computer-vision-based facial keypoint tracking.
Results show that DFECS AUs estimated from the DISFA dataset can account for an average variance of up to 91.29 percent in test datasets.
87.5 percent of DFECS AUs are interpretable, i.e., align with the direction of facial muscle movements.
arXiv Detail & Related papers (2024-06-08T10:45:38Z) - OpticalDR: A Deep Optical Imaging Model for Privacy-Protective
Depression Recognition [66.91236298878383]
Depression Recognition (DR) poses a considerable challenge, especially in the context of privacy concerns.
We design a new imaging system to erase the identity information of captured facial images while retain disease-relevant features.
It is irreversible for identity information recovery while preserving essential disease-related characteristics necessary for accurate DR.
arXiv Detail & Related papers (2024-02-29T01:20:29Z) - Privacy-Preserving Medical Image Classification through Deep Learning
and Matrix Decomposition [0.0]
Deep learning (DL) solutions have been extensively researched in the medical domain in recent years.
The usage of health-related data is strictly regulated, processing medical records outside the hospital environment demands robust data protection measures.
In this paper, we use singular value decomposition (SVD) and principal component analysis (PCA) to obfuscate the medical images before employing them in the DL analysis.
The capability of DL algorithms to extract relevant information from secured data is assessed on a task of angiographic view classification based on obfuscated frames.
arXiv Detail & Related papers (2023-08-31T08:21:09Z) - Is Vertical Logistic Regression Privacy-Preserving? A Comprehensive
Privacy Analysis and Beyond [57.10914865054868]
We consider vertical logistic regression (VLR) trained with mini-batch descent gradient.
We provide a comprehensive and rigorous privacy analysis of VLR in a class of open-source Federated Learning frameworks.
arXiv Detail & Related papers (2022-07-19T05:47:30Z) - Dual Spoof Disentanglement Generation for Face Anti-spoofing with Depth
Uncertainty Learning [54.15303628138665]
Face anti-spoofing (FAS) plays a vital role in preventing face recognition systems from presentation attacks.
Existing face anti-spoofing datasets lack diversity due to the insufficient identity and insignificant variance.
We propose Dual Spoof Disentanglement Generation framework to tackle this challenge by "anti-spoofing via generation"
arXiv Detail & Related papers (2021-12-01T15:36:59Z) - Deep Spatial Gradient and Temporal Depth Learning for Face Anti-spoofing [61.82466976737915]
Depth supervised learning has been proven as one of the most effective methods for face anti-spoofing.
We propose a new approach to detect presentation attacks from multiple frames based on two insights.
The proposed approach achieves state-of-the-art results on five benchmark datasets.
arXiv Detail & Related papers (2020-03-18T06:11:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.