DataLab: A Platform for Data Analysis and Intervention
- URL: http://arxiv.org/abs/2202.12875v1
- Date: Fri, 25 Feb 2022 18:32:19 GMT
- Title: DataLab: A Platform for Data Analysis and Intervention
- Authors: Yang Xiao, Jinlan Fu, Weizhe Yuan, Vijay Viswanathan, Zhoumianze Liu,
Yixin Liu, Graham Neubig and Pengfei Liu
- Abstract summary: DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data.
toolname has features for dataset recommendation and global vision analysis.
So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
- Score: 96.75253335629534
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite data's crucial role in machine learning, most existing tools and
research tend to focus on systems on top of existing data rather than how to
interpret and manipulate data. In this paper, we propose DataLab, a unified
data-oriented platform that not only allows users to interactively analyze the
characteristics of data, but also provides a standardized interface for
different data processing operations. Additionally, in view of the ongoing
proliferation of datasets, \toolname has features for dataset recommendation
and global vision analysis that help researchers form a better view of the data
ecosystem. So far, DataLab covers 1,715 datasets and 3,583 of its transformed
version (e.g., hyponyms replacement), where 728 datasets support various
analyses (e.g., with respect to gender bias) with the help of 140M samples
annotated by 318 feature functions. DataLab is under active development and
will be supported going forward. We have released a web platform, web API,
Python SDK, PyPI published package and online documentation, which hopefully,
can meet the diverse needs of researchers.
Related papers
- OpenDataLab: Empowering General Artificial Intelligence with Open Datasets [53.22840149601411]
This paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing.
OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services.
We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields.
arXiv Detail & Related papers (2024-06-04T10:42:01Z) - DataFinder: Scientific Dataset Recommendation from Natural Language
Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description.
To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set.
This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z) - Demonstration of InsightPilot: An LLM-Empowered Automated Data
Exploration System [48.62158108517576]
We introduce InsightPilot, an automated data exploration system designed to simplify the data exploration process.
InsightPilot automatically selects appropriate analysis intents, such as understanding, summarizing, and explaining.
In brief, an IQuery is an abstraction and automation of data analysis operations, which mimics the approach of data analysts.
arXiv Detail & Related papers (2023-04-02T07:27:49Z) - Data+Shift: Supporting visual investigation of data distribution shifts
by data scientists [1.6311150636417262]
Data+Shift is a visual analytics tool to support data scientists in the task of investigating the underlying factors of shift in data features.
We validated our approach with a think-aloud experiment where a data scientist used the tool for a fraud detection use case.
arXiv Detail & Related papers (2022-04-29T11:50:25Z) - Simplified Data Wrangling with ir_datasets [37.558383796758356]
ir_datases is a tool for acquiring, managing, and performing typical operations over datasets used in Information Retrieval (IR) experiments.
This tool provides both a python and command line interface to numerous IR datasets and benchmarks.
arXiv Detail & Related papers (2021-03-03T09:38:36Z) - MusPy: A Toolkit for Symbolic Music Generation [32.01713268702699]
MusPy is an open source Python library for symbolic music generation.
In this paper, we present statistical analysis of the eleven datasets currently supported by MusPy.
arXiv Detail & Related papers (2020-08-05T06:16:13Z) - Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research.
OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains.
For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z) - PyODDS: An End-to-end Outlier Detection System with Automated Machine
Learning [55.32009000204512]
We present PyODDS, an automated end-to-end Python system for Outlier Detection with Database Support.
Specifically, we define the search space in the outlier detection pipeline, and produce a search strategy within the given search space.
It also provides unified interfaces and visualizations for users with or without data science or machine learning background.
arXiv Detail & Related papers (2020-03-12T03:30:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.