DataFed: Towards Reproducible Research via Federated Data Management
- URL: http://arxiv.org/abs/2004.03710v1
- Date: Tue, 7 Apr 2020 21:05:22 GMT
- Title: DataFed: Towards Reproducible Research via Federated Data Management
- Authors: Dale Stansberry, Suhas Somnath, Jessica Breet, Gregory Shutt, and
Mallikarjun Shankar
- Abstract summary: DataFed is a lightweight, distributed scientific data management system.
It spans a federation of storage systems within a loosely-coupled network of scientific facilities.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasingly collaborative, globalized nature of scientific research
combined with the need to share data and the explosion in data volumes present
an urgent need for a scientific data management system (SDMS). An SDMS presents
a logical and holistic view of data that greatly simplifies and empowers data
organization, curation, searching, sharing, dissemination, etc. We present
DataFed -- a lightweight, distributed SDMS that spans a federation of storage
systems within a loosely-coupled network of scientific facilities. Unlike
existing SDMS offerings, DataFed uses high-performance and scalable user
management and data transfer technologies that simplify deployment,
maintenance, and expansion of DataFed. DataFed provides web-based and
command-line interfaces to manage data and integrate with complex scientific
workflows. DataFed represents a step towards reproducible scientific research
by enabling reliable staging of the correct data at the desired environment.
Related papers
- A Systematic Review of NeurIPS Dataset Management Practices [7.974245534539289]
We present a systematic review of datasets published at the NeurIPS track, focusing on four key aspects: provenance, distribution, ethical disclosure, and licensing.
Our findings reveal that dataset provenance is often unclear due to ambiguous filtering and curation processes.
These inconsistencies underscore the urgent need for standardized data infrastructures for the publication and management of datasets.
arXiv Detail & Related papers (2024-10-31T23:55:41Z) - Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models [79.65071553905021]
We propose Data Advisor, a method for generating data that takes into account the characteristics of the desired dataset.
Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation.
arXiv Detail & Related papers (2024-10-07T17:59:58Z) - OpenDataLab: Empowering General Artificial Intelligence with Open Datasets [53.22840149601411]
This paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing.
OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services.
We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields.
arXiv Detail & Related papers (2024-06-04T10:42:01Z) - Data-driven Discovery with Large Generative Models [47.324203863823335]
This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs)
We demonstrate how LGMs fulfill several desideratas for an ideal data-driven discovery system.
We advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms.
arXiv Detail & Related papers (2024-02-21T08:26:43Z) - Transforming Agriculture with Intelligent Data Management and Insights [3.027257459810039]
Modern agriculture faces grand challenges to meet increased demands for food, fuel, feed, and fiber under the constraints of climate change and dwindling natural resources.
Data innovation is urgently required to secure and improve the productivity, sustainability, and resilience of our agroecosystems.
arXiv Detail & Related papers (2023-11-07T22:02:54Z) - A Versatile Data Fabric for Advanced IoT-Based Remote Health Monitoring [0.8789651809819904]
This paper presents a data-centric and security-focused data fabric designed for digital health applications.
The proposed data fabric comprises an architecture and a toolkit that facilitate the integration of heterogeneous data sources.
We present the implementation of our data fabric in a home-based telemonitoring research project involving older adults.
arXiv Detail & Related papers (2023-10-02T22:05:48Z) - Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions.
To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter.
Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z) - Outsourcing Training without Uploading Data via Efficient Collaborative
Open-Source Sampling [49.87637449243698]
Traditional outsourcing requires uploading device data to the cloud server.
We propose to leverage widely available open-source data, which is a massive dataset collected from public and heterogeneous sources.
We develop a novel strategy called Efficient Collaborative Open-source Sampling (ECOS) to construct a proximal proxy dataset from open-source data for cloud training.
arXiv Detail & Related papers (2022-10-23T00:12:18Z) - A big data intelligence marketplace and secure analytics experimentation
platform for the aviation industry [0.0]
This paper introduces the ICARUS big data-enabled platform that offers a novel aviation data and intelligence marketplace.
It holistically handles the complete big data lifecycle from the data collection, data curation and data exploration to the data integration and data analysis.
arXiv Detail & Related papers (2021-11-18T18:51:40Z) - RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement
Learning [17.87592413742589]
RLDS is an ecosystem for recording, replaying, manipulating, annotating and sharing data in the context of Sequential Decision Making (SDM)
RLDS enables not only of existing research and easy generation of new datasets, but also accelerates novel research.
The RLDS ecosystem makes it easy to share datasets without any loss of information and to be agnostic to the underlying original format.
arXiv Detail & Related papers (2021-11-04T11:48:19Z) - Data Mining with Big Data in Intrusion Detection Systems: A Systematic
Literature Review [68.15472610671748]
Cloud computing has become a powerful and indispensable technology for complex, high performance and scalable computation.
The rapid rate and volume of data creation has begun to pose significant challenges for data management and security.
The design and deployment of intrusion detection systems (IDS) in the big data setting has, therefore, become a topic of importance.
arXiv Detail & Related papers (2020-05-23T20:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.