A Late-Fusion Multimodal AI Framework for Privacy-Preserving Deduplication in National Healthcare Data Environments
- URL: http://arxiv.org/abs/2603.04595v1
- Date: Wed, 04 Mar 2026 20:46:26 GMT
- Title: A Late-Fusion Multimodal AI Framework for Privacy-Preserving Deduplication in National Healthcare Data Environments
- Authors: Mohammed Omer Shakeel Ahmed,
- Abstract summary: I propose a novel, scalable, multimodal AI framework for detecting duplicates without depending on sensitive information.<n>This approach offers a privacy-compliant solution to entity resolution, supports secure digital infrastructure, and enhances the reliability of public health analytics.<n>It is well-suited for integration into national health data modernization efforts, aligning with broader goals of privacy-first innovation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Duplicate records pose significant challenges in customer relationship management (CRM)and healthcare, often leading to inaccuracies in analytics, impaired user experiences, and compliance risks. Traditional deduplication methods rely heavily on direct identifiers such as names, emails, or Social Security Numbers (SSNs), making them ineffective under strict privacy regulations like GDPR and HIPAA, where such personally identifiable information (PII) is restricted or masked. In this research, I propose a novel, scalable, multimodal AI framework for detecting duplicates without depending on sensitive information. This system leverages three distinct modalities: semantic embeddings derived from textual fields (names, cities) using pre-trained DistilBERT models, behavioral patterns extracted from user login timestamps, and device metadata encoded through categorical embeddings. These heterogeneous modalities are combined using a late fusion approach and clustered via DBSCAN, an unsupervised density-based algorithm. This proposed model is evaluated against a traditional string-matching baseline on a synthetic CRM dataset specifically designed to reflect privacy-preserving constraints. The multimodal framework demonstrated good performance, achieving a good F1-score by effectively identifying duplicates despite variations and noise inherent in the data. This approach offers a privacy-compliant solution to entity resolution and supports secure digital infrastructure, enhances the reliability of public health analytics, and promotes ethical AI adoption across government and enterprise settings. It is well-suited for integration into national health data modernization efforts, aligning with broader goals of privacy-first innovation.
Related papers
- SynBench: A Benchmark for Differentially Private Text Generation [35.908455649647784]
Data-driven decision support in high-stakes domains like healthcare and finance faces significant barriers to data sharing.<n>Recent generative AI models, such as large language models, have shown impressive performance in open-domain tasks.<n>But their adoption in sensitive environments remains limited by unpredictable behaviors and insufficient privacy-preserving datasets.
arXiv Detail & Related papers (2025-09-18T03:57:50Z) - On the MIA Vulnerability Gap Between Private GANs and Diffusion Models [51.53790101362898]
Generative Adversarial Networks (GANs) and diffusion models have emerged as leading approaches for high-quality image synthesis.<n>We present the first unified theoretical and empirical analysis of the privacy risks faced by differentially private generative models.
arXiv Detail & Related papers (2025-09-03T14:18:22Z) - RL-Finetuned LLMs for Privacy-Preserving Synthetic Rewriting [17.294176570269]
We propose a reinforcement learning framework that fine-tunes a large language model (LLM) using a composite reward function.<n>The privacy reward combines semantic cues with structural patterns derived from a minimum spanning tree (MST) over latent representations.<n> Empirical results show that the proposed method significantly enhances author obfuscation and privacy metrics without degrading semantic quality.
arXiv Detail & Related papers (2025-08-25T04:38:19Z) - Differentially Private Relational Learning with Entity-level Privacy Guarantees [17.567309430451616]
This work presents a principled framework for relational learning with formal entity-level DP guarantees.<n>We introduce an adaptive gradient clipping scheme that modulates clipping thresholds based on entity occurrence frequency.<n>These contributions lead to a tailored DP-SGD variant for relational data with provable privacy guarantees.
arXiv Detail & Related papers (2025-06-10T02:03:43Z) - RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification [10.378433440829712]
We propose a fully automated framework, RedactOR for de-identifying structured and unstructured electronic health records.<n>Our framework employs cost-efficient De-ID strategies, including intelligent routing, hybrid rule and LLM based approaches.<n>We present a retrieval-based entity relexicalization approach to ensure consistent substitutions of protected entities.
arXiv Detail & Related papers (2025-05-23T21:13:18Z) - Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation [60.81109086640437]
We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG)<n>FedE4RAG facilitates collaborative training of client-side RAG retrieval models.<n>We apply homomorphic encryption within federated learning to safeguard model parameters.
arXiv Detail & Related papers (2025-04-27T04:26:02Z) - PriRoAgg: Achieving Robust Model Aggregation with Minimum Privacy Leakage for Federated Learning [49.916365792036636]
Federated learning (FL) has recently gained significant momentum due to its potential to leverage large-scale distributed user data.<n>The transmitted model updates can potentially leak sensitive user information, and the lack of central control of the local training process leaves the global model susceptible to malicious manipulations on model updates.<n>We develop a general framework PriRoAgg, utilizing Lagrange coded computing and distributed zero-knowledge proof, to execute a wide range of robust aggregation algorithms while satisfying aggregated privacy.
arXiv Detail & Related papers (2024-07-12T03:18:08Z) - Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions.
We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z) - Differentially-Private Data Synthetisation for Efficient Re-Identification Risk Control [3.8811062755861956]
$epsilon$-PrivateSMOTE is a technique for safeguarding against re-identification and linkage attacks.
Our proposal combines synthetic data generation via noise-induced adversarial with differential privacy principles to obfuscate high-risk cases.
arXiv Detail & Related papers (2022-12-01T13:20:37Z) - Federated Offline Reinforcement Learning [55.326673977320574]
We propose a multi-site Markov decision process model that allows for both homogeneous and heterogeneous effects across sites.
We design the first federated policy optimization algorithm for offline RL with sample complexity.
We give a theoretical guarantee for the proposed algorithm, where the suboptimality for the learned policies is comparable to the rate as if data is not distributed.
arXiv Detail & Related papers (2022-06-11T18:03:26Z) - Hide-and-Seek Privacy Challenge [88.49671206936259]
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition to accelerate progress in tackling both problems.
In our head-to-head format, participants in the synthetic data generation track (i.e. "hiders") and the patient re-identification track (i.e. "seekers") are directly pitted against each other by way of a new, high-quality intensive care time-series dataset.
arXiv Detail & Related papers (2020-07-23T15:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.