Related papers: Scalable Discovery and Continuous Inventory of Personal Data at Rest in Cloud Native Systems

Scalable Discovery and Continuous Inventory of Personal Data at Rest in Cloud Native Systems

URL: http://arxiv.org/abs/2209.10412v1
Date: Fri, 9 Sep 2022 10:45:34 GMT
Title: Scalable Discovery and Continuous Inventory of Personal Data at Rest in Cloud Native Systems
Authors: Elias Gr\"unewald and Leonard Schurbert
Abstract summary: Cloud native systems are processing large amounts of personal data through numerous and possibly multi-paradigmatic data stores. From a privacy engineering perspective, a core challenge is to keep track of all exact locations, where personal data is being stored. We present Teiresias, comprising i) a workflow pattern for scalable discovery of personal data at rest, and ii) a cloud native system architecture and open source prototype implementation of said workflow pattern.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cloud native systems are processing large amounts of personal data through numerous and possibly multi-paradigmatic data stores (e.g., relational and non-relational databases). From a privacy engineering perspective, a core challenge is to keep track of all exact locations, where personal data is being stored, as required by regulatory frameworks such as the European General Data Protection Regulation. In this paper, we present Teiresias, comprising i) a workflow pattern for scalable discovery of personal data at rest, and ii) a cloud native system architecture and open source prototype implementation of said workflow pattern. To this end, we enable a continuous inventory of personal data featuring transparency and accountability following DevOps/DevPrivOps practices. In particular, we scope version-controlled Infrastructure as Code definitions, cloud-based storages, and how to integrate the process into CI/CD pipelines. Thereafter, we provide iii) a comparative performance evaluation demonstrating both appropriate execution times for real-world settings, and a promising personal data detection accuracy outperforming existing proprietary tools in public clouds.

Related papers

Governing Cloud Data Pipelines with Agentic AI [0.0]
Agentic Cloud Data Engineering is a policy-aware control architecture that integrates bounded AI agents into the governance and control plane of cloud data pipelines.<n>We show that Agentic Cloud Data Engineering platform reduces mean pipeline recovery time by up to 45%, lowers operational cost by approximately 25%, and decreases manual intervention events by over 70% compared to static orchestration.
arXiv Detail & Related papers (2025-12-24T19:30:32Z)
CPSLint: A Domain-Specific Language Providing Data Validation and Sanitisation for Industrial Cyber-Physical Systems [0.5499796332553707]
We introduce CPSLint, a Domain-Specific Language designed to provide data preparation for industrial CPS.<n>Main features include type checking and enforcing constraints through validation and remediation for data columns.<n>More advanced features cover inference of extra CPS-specific data structures, both column-wise and row-wise.
arXiv Detail & Related papers (2025-10-21T13:59:56Z)
CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering [68.91862701376155]
CoSteer is a novel collaborative framework that enables decoding-time personalization through localized delta steering.<n>We formulate token-level optimization as an online learning problem, where local delta vectors dynamically adjust the remote LLM's logits.<n>This approach preserves privacy by transmitting only the final steered tokens rather than raw data or intermediate vectors.
arXiv Detail & Related papers (2025-07-07T08:32:29Z)
PyTupli: A Scalable Infrastructure for Collaborative Offline Reinforcement Learning Projects [5.744272697629195]
offline reinforcement learning (RL) has gained traction as a powerful paradigm for learning control policies from pre-collected data.<n>PyTupli is a Python-based tool to streamline the creation, storage, and dissemination of benchmark environments.
arXiv Detail & Related papers (2025-05-22T14:59:20Z)
Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data [10.1687640711587]
This work introduces the notion of "surrogate" public data, which consume no privacy loss budget and are constructed solely from publicly available schema or metadata. We automate the process of generating surrogate public data with large language models (LLMs) In particular, we propose two methods: direct record generation as CSV files, and automated structural causal model (SCM) construction for sampling records.
arXiv Detail & Related papers (2025-04-19T17:55:10Z)
Enhancing Pavement Sensor Data Acquisition for AI-Driven Transportation Research [1.22995445255292]
This paper presents comprehensive guidelines for managing transportation sensor data. It covers both archived static data and real-time data streams. The proposals were applied to INDOT's real-world case studies involving the I-65 and I-69 Greenfield districts.
arXiv Detail & Related papers (2025-02-20T03:37:46Z)
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
Object as a Service: Simplifying Cloud-Native Development through Serverless Object Abstraction [1.7416288134936873]
We propose a new paradigm, known as Object as a Service (O) that encapsulates application data and functions into the cloud object abstraction. O relieves developers from resource and data management burden while offering built-in optimization features. We develop a platform named Oparaca that offers state abstraction for structured and unstructured data with consistency and fault-tolerant guarantees.
arXiv Detail & Related papers (2024-08-09T06:55:00Z)
PeFAD: A Parameter-Efficient Federated Framework for Time Series Anomaly Detection [51.20479454379662]
We propose a. Federated Anomaly Detection framework named PeFAD with the increasing privacy concerns. We conduct extensive evaluations on four real datasets, where PeFAD outperforms existing state-of-the-art baselines by up to 28.74%.
arXiv Detail & Related papers (2024-06-04T13:51:08Z)
ST-DPGAN: A Privacy-preserving Framework for Spatiotemporal Data Generation [19.18074489351738]
We propose a Graph-based model for generating privacy-protected data. Experiments conducted on three real-worldtemporal datasets validate the efficacy of our model. The prediction model trained on our generated data maintains a competitive edge compared to the model trained on the original data.
arXiv Detail & Related papers (2024-06-04T04:43:54Z)
Federated Learning Empowered by Generative Content [55.576885852501775]
Federated learning (FL) enables leveraging distributed private data for model training in a privacy-preserving way. We propose a novel FL framework termed FedGC, designed to mitigate data heterogeneity issues by diversifying private data with generative content. We conduct a systematic empirical study on FedGC, covering diverse baselines, datasets, scenarios, and modalities.
arXiv Detail & Related papers (2023-12-10T07:38:56Z)
Hawk: DevOps-driven Transparency and Accountability in Cloud Native Systems [0.0]
Transparency is one of the most important principles of modern privacy regulations. Data controllers must provide data subjects with precise information about the collection, processing, storage, and transfer of personal data.
arXiv Detail & Related papers (2023-06-04T22:09:42Z)
Privacy-Preserving Machine Learning for Collaborative Data Sharing via Auto-encoder Latent Space Embeddings [57.45332961252628]
Privacy-preserving machine learning in data-sharing processes is an ever-critical task. This paper presents an innovative framework that uses Representation Learning via autoencoders to generate privacy-preserving embedded data.
arXiv Detail & Related papers (2022-11-10T17:36:58Z)
Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge. Existing private generative models are struggling with the utility of synthetic samples. We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z)
Outsourcing Training without Uploading Data via Efficient Collaborative Open-Source Sampling [49.87637449243698]
Traditional outsourcing requires uploading device data to the cloud server. We propose to leverage widely available open-source data, which is a massive dataset collected from public and heterogeneous sources. We develop a novel strategy called Efficient Collaborative Open-source Sampling (ECOS) to construct a proximal proxy dataset from open-source data for cloud training.
arXiv Detail & Related papers (2022-10-23T00:12:18Z)
Reasoning over Public and Private Data in Retrieval-Based Systems [29.515915401413334]
State-of-the-art systems explicitly retrieve relevant information to a user question from a background corpus before producing an answer. While today's retrieval systems assume the corpus is fully accessible, users are often unable or unwilling to expose their private data to entities hosting public data. We first define the PUBLIC-PRIVATE AUTOREGRESSIVE Information RETRIEVAL (PAIR) privacy framework for the novel retrieval setting over multiple privacy scopes.
arXiv Detail & Related papers (2022-03-14T13:08:51Z)
On-Device Learning with Cloud-Coordinated Data Augmentation for Extreme Model Personalization in Recommender Systems [39.41506296601779]
We propose a new device-cloud collaborative learning framework, called CoDA, to break the dilemmas of purely cloud-based learning and on-device learning. CoDA retrieves similar samples from the cloud's global pool to augment each user's local dataset to train the recommendation model. Online A/B testing results show the remarkable performance improvement of CoDA over both cloud-based learning without model personalization and on-device training without data augmentation.
arXiv Detail & Related papers (2022-01-24T04:59:04Z)
Unsupervised Model Personalization while Preserving Privacy and Scalability: An Open Problem [55.21502268698577]
This work investigates the task of unsupervised model personalization, adapted to continually evolving, unlabeled local user images. We provide a novel Dual User-Adaptation framework (DUA) to explore the problem. This framework flexibly disentangles user-adaptation into model personalization on the server and local data regularization on the user device.
arXiv Detail & Related papers (2020-03-30T09:35:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.