Related papers: Blockchain-Enabled Accountability in Data Supply Chain: A Data Bill of Materials Approach

Blockchain-Enabled Accountability in Data Supply Chain: A Data Bill of Materials Approach

URL: http://arxiv.org/abs/2408.08536v1
Date: Fri, 16 Aug 2024 05:34:50 GMT
Title: Blockchain-Enabled Accountability in Data Supply Chain: A Data Bill of Materials Approach
Authors: Yue Liu, Dawen Zhang, Boming Xia, Julia Anticev, Tunde Adebayo, Zhenchang Xing, Moses Machao,
Abstract summary: We introduce Data Bill of Materials" (DataBOM) to capture the dependency relationship between different datasets and stakeholders by storing specific metadata. We demonstrate a platform architecture for providing blockchain-based DataBOM services, present the interaction protocol for stakeholders, and discuss the minimal requirements for DataBOM metadata.
Score: 16.31469678670097
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the era of advanced artificial intelligence, highlighted by large-scale generative models like GPT-4, ensuring the traceability, verifiability, and reproducibility of datasets throughout their lifecycle is paramount for research institutions and technology companies. These organisations increasingly rely on vast corpora to train and fine-tune advanced AI models, resulting in intricate data supply chains that demand effective data governance mechanisms. In addition, the challenge intensifies as diverse stakeholders may use assorted tools, often without adequate measures to ensure the accountability of data and the reliability of outcomes. In this study, we adapt the concept of ``Software Bill of Materials" into the field of data governance and management to address the above challenges, and introduce ``Data Bill of Materials" (DataBOM) to capture the dependency relationship between different datasets and stakeholders by storing specific metadata. We demonstrate a platform architecture for providing blockchain-based DataBOM services, present the interaction protocol for stakeholders, and discuss the minimal requirements for DataBOM metadata. The proposed solution is evaluated in terms of feasibility and performance via case study and quantitative analysis respectively.

Related papers

Towards an Introspective Dynamic Model of Globally Distributed Computing Infrastructures [27.473508984130728]
Large-scale scientific collaborations generate petabytes of data, with volumes soon expected to reach exabytes.<n>To manage these computational and storage demands, centralized workflow and data management systems are implemented.<n>A significant obstacle in adopting more effective or AI-driven solutions is the absence of a quick and reliable introspective dynamic model.
arXiv Detail & Related papers (2025-06-24T12:42:36Z)
Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks. We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z)
Towards Human-Guided, Data-Centric LLM Co-Pilots [53.35493881390917]
CliMB-DC is a human-guided, data-centric framework for machine learning co-pilots. It combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. We show how CliMB-DC can transform uncurated datasets into ML-ready formats.
arXiv Detail & Related papers (2025-01-17T17:51:22Z)
MetaTrading: An Immersion-Aware Model Trading Framework for Vehicular Metaverse Services [94.61039892220037]
We propose an immersion-aware model trading framework that facilitates data provision for services while ensuring privacy through federated learning (FL) We design an incentive mechanism to incentivize metaverse users (MUs) to contribute high-value models under resource constraints. We develop a fully distributed dynamic reward algorithm based on deep reinforcement learning, without accessing any private information about MUs and other MSPs.
arXiv Detail & Related papers (2024-10-25T16:20:46Z)
A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models. We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z)
CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems [10.71630696651595]
Compound AI systems (CASs) that employ LLMs as agents to accomplish knowledge-intensive tasks have garnered significant interest within database and AI communities. silos of multimodal data sources make it difficult to identify appropriate data sources for accomplishing the task at hand. We propose CMDBench, a benchmark modeling the complexity of enterprise data platforms.
arXiv Detail & Related papers (2024-06-02T01:10:41Z)
A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples. Existing literature surveys only focus on a certain type of specific modality data. We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z)
Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z)
How Much Data are Enough? Investigating Dataset Requirements for Patch-Based Brain MRI Segmentation Tasks [74.21484375019334]
Training deep neural networks reliably requires access to large-scale datasets. To mitigate both the time and financial costs associated with model development, a clear understanding of the amount of data required to train a satisfactory model is crucial. This paper proposes a strategic framework for estimating the amount of annotated data required to train patch-based segmentation networks.
arXiv Detail & Related papers (2024-04-04T13:55:06Z)
Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets. We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers. Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z)
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
Towards Avoiding the Data Mess: Industry Insights from Data Mesh Implementations [1.5029560229270191]
Data mesh is a socio-technical, decentralized, distributed concept for enterprise data management. We conduct 15 semi-structured interviews with industry experts. Our findings synthesize insights from industry experts and provide researchers and professionals with preliminary guidelines for the successful adoption of data mesh.
arXiv Detail & Related papers (2023-02-03T13:09:57Z)
Towards federated multivariate statistical process control (FedMSPC) [1.8047694351309207]
We propose a privacy-preserving, federated statistical process control (FedMSPC) framework based on Federated Principal Component Analysis (PCA) and Secure Multiparty Computation. Our empirical results demonstrate the superior fault detection capability of the proposed approach compared to standard, single-party (multiway) PCA.
arXiv Detail & Related papers (2022-11-03T08:36:30Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.