Semantic Modelling of Organizational Knowledge as a Basis for Enterprise
Data Governance 4.0 -- Application to a Unified Clinical Data Model
- URL: http://arxiv.org/abs/2311.02082v3
- Date: Thu, 23 Nov 2023 21:30:39 GMT
- Title: Semantic Modelling of Organizational Knowledge as a Basis for Enterprise
Data Governance 4.0 -- Application to a Unified Clinical Data Model
- Authors: Miguel AP Oliveira, Stephane Manara, Bruno Mol\'e, Thomas Muller,
Aur\'elien Guillouche, Lysann Hesske, Bruce Jordan, Gilles Hubert, Chinmay
Kulkarni, Pralipta Jagdev and Cedric R. Berger
- Abstract summary: We establish a simple, cost-efficient framework that enables metadata-driven, agile and (semi-automated) data governance.
We explain how we implement and use this framework to integrate 25 years of clinical study data at an enterprise scale in a fully productive environment.
- Score: 6.302916372143144
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Individuals and organizations cope with an always-growing amount of data,
which is heterogeneous in its contents and formats. An adequate data management
process yielding data quality and control over its lifecycle is a prerequisite
to getting value out of this data and minimizing inherent risks related to
multiple usages. Common data governance frameworks rely on people, policies,
and processes that fall short of the overwhelming complexity of data. Yet,
harnessing this complexity is necessary to achieve high-quality standards. The
latter will condition any downstream data usage outcome, including generative
artificial intelligence trained on this data. In this paper, we report our
concrete experience establishing a simple, cost-efficient framework that
enables metadata-driven, agile and (semi-)automated data governance (i.e. Data
Governance 4.0). We explain how we implement and use this framework to
integrate 25 years of clinical study data at an enterprise scale in a fully
productive environment. The framework encompasses both methodologies and
technologies leveraging semantic web principles. We built a knowledge graph
describing avatars of data assets in their business context, including
governance principles. Multiple ontologies articulated by an enterprise upper
ontology enable key governance actions such as FAIRification, lifecycle
management, definition of roles and responsibilities, lineage across
transformations and provenance from source systems. This metadata model is the
keystone to data governance 4.0: a semi-automatised data management process
that considers the business context in an agile manner to adapt governance
constraints to each use case and dynamically tune it based on business changes.
Related papers
- Data Science and Technology Towards AGI Part I: Tiered Data Management [53.64581824953229]
We argue that the development of artificial intelligence is entering a new phase of data-model co-evolution.<n>We introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge.<n>We validate the effectiveness of the proposed framework through empirical studies.
arXiv Detail & Related papers (2026-02-09T18:47:51Z) - EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents [12.7922877987936]
EntWorld is a large-scale benchmark consisting of 1,756 tasks across six representative enterprise domains.<n>We propose a schema-grounded task generation framework that directly reverse-engineers business logic from underlying database schemas.<n>We show that state-of-the-art models achieve 47.61% success rate on EntWorld, substantially lower than the human performance.
arXiv Detail & Related papers (2026-01-25T06:58:15Z) - DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows [22.16698382751559]
Large language models (LLMs) have emerged as a promising solution for automating data governance by translating user intent into code.<n>Existing benchmarks for automated data science often emphasize snippet-level coding or high-level analytics.<n>We introduce DataGovBench, a benchmark featuring 150 diverse tasks grounded in real-world scenarios, built on data from actual cases.
arXiv Detail & Related papers (2025-12-04T03:25:12Z) - UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation [70.2215233759276]
UtilGen is a novel utility-centric data augmentation framework for computer vision tasks.<n>We show that UtilGen consistently achieves superior datasets, with an average accuracy improvement of 3.87% over previous SOTA.<n>Further analysis of data influence and distribution reveals that UtilGen produces more impactful and task-relevant synthetic data.
arXiv Detail & Related papers (2025-10-28T10:17:11Z) - Scaling Generalist Data-Analytic Agents [95.05161133349242]
DataMind is a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents.<n>DataMind tackles three key challenges in building open-source data-analytic agents.
arXiv Detail & Related papers (2025-09-29T17:23:08Z) - Autonomous Data Agents: A New Opportunity for Smart Data [50.02229219403014]
Report argues that DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems.<n>DataAgents transform complex and unstructured data into coherent and actionable knowledge.<n>We first examine why the convergence of agentic AI and data-to-knowledge systems has emerged as a critical trend.
arXiv Detail & Related papers (2025-09-23T06:46:41Z) - Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.
We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z) - A Systematic Review of NeurIPS Dataset Management Practices [7.974245534539289]
We present a systematic review of datasets published at the NeurIPS track, focusing on four key aspects: provenance, distribution, ethical disclosure, and licensing.
Our findings reveal that dataset provenance is often unclear due to ambiguous filtering and curation processes.
These inconsistencies underscore the urgent need for standardized data infrastructures for the publication and management of datasets.
arXiv Detail & Related papers (2024-10-31T23:55:41Z) - Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.
Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.
We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - A Theoretical Framework for AI-driven data quality monitoring in high-volume data environments [1.2753215270475886]
This paper presents a theoretical framework for an AI-driven data quality monitoring system designed to address the challenges of maintaining data quality in high-volume environments.
We examine the limitations of traditional methods in managing the scale, velocity, and variety of big data and propose a conceptual approach leveraging advanced machine learning techniques.
Key components include an intelligent data ingestion layer, adaptive preprocessing mechanisms, context-aware feature extraction, and AI-based quality assessment modules.
arXiv Detail & Related papers (2024-10-11T07:06:36Z) - Blockchain-Enabled Accountability in Data Supply Chain: A Data Bill of Materials Approach [16.31469678670097]
We introduce Data Bill of Materials" (DataBOM) to capture the dependency relationship between different datasets and stakeholders by storing specific metadata.
We demonstrate a platform architecture for providing blockchain-based DataBOM services, present the interaction protocol for stakeholders, and discuss the minimal requirements for DataBOM metadata.
arXiv Detail & Related papers (2024-08-16T05:34:50Z) - DataGen: Unified Synthetic Dataset Generation via Large Language Models [88.16197692794707]
DataGen is a comprehensive framework designed to produce diverse, accurate, and highly controllable datasets.<n>To augment data diversity, DataGen incorporates an attribute-guided generation module and a group checking feature.<n>Extensive experiments demonstrate the superior quality of data generated by DataGen.
arXiv Detail & Related papers (2024-06-27T07:56:44Z) - Efficient Data Collection for Robotic Manipulation via Compositional Generalization [70.76782930312746]
We show that policies can compose environmental factors from their data to succeed when encountering unseen factor combinations.
We propose better in-domain data collection strategies that exploit composition.
We provide videos at http://iliad.stanford.edu/robot-data-comp/.
arXiv Detail & Related papers (2024-03-08T07:15:38Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - Transforming Agriculture with Intelligent Data Management and Insights [3.027257459810039]
Modern agriculture faces grand challenges to meet increased demands for food, fuel, feed, and fiber under the constraints of climate change and dwindling natural resources.
Data innovation is urgently required to secure and improve the productivity, sustainability, and resilience of our agroecosystems.
arXiv Detail & Related papers (2023-11-07T22:02:54Z) - Robot Fleet Learning via Policy Merging [58.5086287737653]
We propose FLEET-MERGE to efficiently merge policies in the fleet setting.
We show that FLEET-MERGE consolidates the behavior of policies trained on 50 tasks in the Meta-World environment.
We introduce a novel robotic tool-use benchmark, FLEET-TOOLS, for fleet policy learning in compositional and contact-rich robot manipulation tasks.
arXiv Detail & Related papers (2023-10-02T17:23:51Z) - 1st ICLR International Workshop on Privacy, Accountability,
Interpretability, Robustness, Reasoning on Structured Data (PAIR^2Struct) [28.549151517783287]
Data Privacy, Accountability, Interpretability, Robustness, and Reasoning have been recognized as fundamental principles of using machine learning (ML) technologies on decision-critical and/or privacy-sensitive applications.
By exploiting the inherently structured knowledge, one can design plausible approaches to identify and use more relevant variables to make reliable decisions.
arXiv Detail & Related papers (2022-10-07T15:12:03Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - CateCom: a practical data-centric approach to categorization of
computational models [77.34726150561087]
We present an effort aimed at organizing the landscape of physics-based and data-driven computational models.
We apply object-oriented design concepts and outline the foundations of an open-source collaborative framework.
arXiv Detail & Related papers (2021-09-28T02:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.