CPSLint: A Domain-Specific Language Providing Data Validation and Sanitisation for Industrial Cyber-Physical Systems
- URL: http://arxiv.org/abs/2510.18651v1
- Date: Tue, 21 Oct 2025 13:59:56 GMT
- Title: CPSLint: A Domain-Specific Language Providing Data Validation and Sanitisation for Industrial Cyber-Physical Systems
- Authors: Uraz Odyurt, Ömer Sayilir, Mariëlle Stoelinga, Vadim Zaytsev,
- Abstract summary: We introduce CPSLint, a Domain-Specific Language designed to provide data preparation for industrial CPS.<n>Main features include type checking and enforcing constraints through validation and remediation for data columns.<n>More advanced features cover inference of extra CPS-specific data structures, both column-wise and row-wise.
- Score: 0.5499796332553707
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Raw datasets are often too large and unstructured to work with directly, and require a data preparation process. The domain of industrial Cyber-Physical Systems (CPS) is no exception, as raw data typically consists of large amounts of time-series data logging the system's status in regular time intervals. Such data has to be sanity checked and preprocessed to be consumable by data-centric workflows. We introduce CPSLint, a Domain-Specific Language designed to provide data preparation for industrial CPS. We build up on the fact that many raw data collections in the CPS domain require similar actions to render them suitable for Machine-Learning (ML) solutions, e.g., Fault Detection and Identification (FDI) workflows, yet still vary enough to hope for one universally applicable solution. CPSLint's main features include type checking and enforcing constraints through validation and remediation for data columns, such as imputing missing data from surrounding rows. More advanced features cover inference of extra CPS-specific data structures, both column-wise and row-wise. For instance, as row-wise structures, descriptive execution phases are an effective method of data compartmentalisation are extracted and prepared for ML-assisted FDI workflows. We demonstrate CPSLint's features through a proof of concept implementation.
Related papers
- Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents [85.02904078131682]
We introduce the agent data protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets.<n> ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic.<n>All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.
arXiv Detail & Related papers (2025-10-28T17:53:13Z) - Agentic AI framework for End-to-End Medical Data Inference [5.871161259593687]
We introduce an Agentic AI framework that automates the entire clinical data pipeline, from ingestion to inference.<n>We evaluate the system on publicly available datasets from geriatrics, palliative care, and colonoscopy imaging.
arXiv Detail & Related papers (2025-07-24T05:56:25Z) - Efficient Conformance Checking of Rich Data-Aware Declare Specifications (Extended) [49.46686813437884]
We show that it is possible to compute data-aware optimal alignments in a rich setting with general data types and data conditions.<n>This is achieved by carefully combining the two best-known approaches to deal with control flow and data dependencies.
arXiv Detail & Related papers (2025-06-30T10:16:21Z) - PeFAD: A Parameter-Efficient Federated Framework for Time Series Anomaly Detection [51.20479454379662]
We propose a.
Federated Anomaly Detection framework named PeFAD with the increasing privacy concerns.
We conduct extensive evaluations on four real datasets, where PeFAD outperforms existing state-of-the-art baselines by up to 28.74%.
arXiv Detail & Related papers (2024-06-04T13:51:08Z) - Retrieval-Augmented Mining of Temporal Logic Specifications from Data [0.46040036610482665]
This work addresses the task of learning STL requirements from observed behaviors in a data-driven manner.
We present a novel framework that combines Bayesian Optimization (BO) and Information Retrieval (IR) techniques to simultaneously learn both the structure and the parameters of STL formulae.
arXiv Detail & Related papers (2024-05-23T09:29:00Z) - Integration of Domain Expert-Centric Ontology Design into the CRISP-DM for Cyber-Physical Production Systems [45.05372822216111]
Methods from Machine Learning (ML) and Data Mining (DM) have proven to be promising in extracting complex and hidden patterns from the data collected.
However, such data-driven projects, usually performed with the Cross-Industry Standard Process for Data Mining (CRISPDM), often fail due to the disproportionate amount of time needed for understanding and preparing the data.
This contribution intends present an integrated approach so that data scientists are able to more quickly and reliably gain insights into the CPPS challenges.
arXiv Detail & Related papers (2023-07-21T15:04:00Z) - Scalable Discovery and Continuous Inventory of Personal Data at Rest in
Cloud Native Systems [0.0]
Cloud native systems are processing large amounts of personal data through numerous and possibly multi-paradigmatic data stores.
From a privacy engineering perspective, a core challenge is to keep track of all exact locations, where personal data is being stored.
We present Teiresias, comprising i) a workflow pattern for scalable discovery of personal data at rest, and ii) a cloud native system architecture and open source prototype implementation of said workflow pattern.
arXiv Detail & Related papers (2022-09-09T10:45:34Z) - Robust Event Classification Using Imperfect Real-world PMU Data [58.26737360525643]
We study robust event classification using imperfect real-world phasor measurement unit (PMU) data.
We develop a novel machine learning framework for training robust event classifiers.
arXiv Detail & Related papers (2021-10-19T17:41:43Z) - Learning to Limit Data Collection via Scaling Laws: Data Minimization
Compliance in Practice [62.44110411199835]
We build on literature in machine learning law to propose framework for limiting collection based on data interpretation that ties data to system performance.
We formalize a data minimization criterion based on performance curve derivatives and provide an effective and interpretable piecewise power law technique.
arXiv Detail & Related papers (2021-07-16T19:59:01Z) - ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for
Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications.
We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN)
We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.