AutoFAIR : Automatic Data FAIRification via Machine Reading
- URL: http://arxiv.org/abs/2408.04673v1
- Date: Wed, 7 Aug 2024 17:36:58 GMT
- Title: AutoFAIR : Automatic Data FAIRification via Machine Reading
- Authors: Tingyan Ma, Wei Liu, Bin Lu, Xiaoying Gan, Yunqiang Zhu, Luoyi Fu, Chenghu Zhou,
- Abstract summary: We propose AutoFAIR, an architecture designed to enhance data FAIRness automately.
We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions.
We observe significant improvements in findability, accessibility, interoperability, and reusability of data.
- Score: 28.683653852643015
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we propose AutoFAIR, an architecture designed to enhance data FAIRness automately. Firstly, We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions. Then, We utilize Web Reader to automatically extract metadata based on language models, even in the absence of structured data webpage schemas. Subsequently, FAIR Alignment is employed to make metadata comply with FAIR principles by ontology guidance and semantic matching. Finally, by applying AutoFAIR to various data, especially in the field of mountain hazards, we observe significant improvements in findability, accessibility, interoperability, and reusability of data. The FAIRness scores before and after applying AutoFAIR indicate enhanced data value.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning [3.623224034411137]
offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems.
Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results.
We show how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets.
arXiv Detail & Related papers (2024-09-18T14:13:24Z) - FAIR evaluation of ten widely used chemical datasets: Lessons learned and recommendations [0.0]
This document focuses on databases disseminating data on (hazardous) substances found on the North American and the European (EU) market.
The goal is to analyse the FAIRness of published open data on these substances.
We implement two complementary approaches: Manual, and Automatic.
arXiv Detail & Related papers (2024-07-22T12:26:41Z) - Large Language Models for Automated Data Science: Introducing CAAFE for
Context-Aware Automated Feature Engineering [52.09178018466104]
We introduce Context-Aware Automated Feature Engineering (CAAFE) to generate semantically meaningful features for datasets.
Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets.
We highlight the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML.
arXiv Detail & Related papers (2023-05-05T09:58:40Z) - Demonstration of InsightPilot: An LLM-Empowered Automated Data
Exploration System [48.62158108517576]
We introduce InsightPilot, an automated data exploration system designed to simplify the data exploration process.
InsightPilot automatically selects appropriate analysis intents, such as understanding, summarizing, and explaining.
In brief, an IQuery is an abstraction and automation of data analysis operations, which mimics the approach of data analysts.
arXiv Detail & Related papers (2023-04-02T07:27:49Z) - FETA: Towards Specializing Foundation Models for Expert Task
Applications [49.57393504125937]
Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization.
We show in this paper that FMs still have poor out-of-the-box performance on expert tasks.
We propose a first of its kind FETA benchmark built around the task of teaching FMs to understand technical documentation.
arXiv Detail & Related papers (2022-09-08T08:47:57Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Augment & Valuate : A Data Enhancement Pipeline for Data-Centric AI [19.358073575300004]
We propose a data-centric approach to address the fundamental distributional and semantic properties of dataset with black box models.
We achieve 84.711% test accuracy (ranked #6, Honorable Mention in the Most Innovative) in the Data-Centric AI competition only with the provided dataset.
arXiv Detail & Related papers (2021-12-07T17:22:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.