DataFinder: Scientific Dataset Recommendation from Natural Language
Descriptions
- URL: http://arxiv.org/abs/2305.16636v2
- Date: Wed, 7 Jun 2023 03:08:27 GMT
- Title: DataFinder: Scientific Dataset Recommendation from Natural Language
Descriptions
- Authors: Vijay Viswanathan, Luyu Gao, Tongshuang Wu, Pengfei Liu and Graham
Neubig
- Abstract summary: We operationalize the task of recommending datasets given a short natural language description.
To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set.
This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
- Score: 100.52917027038369
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern machine learning relies on datasets to develop and validate research
ideas. Given the growth of publicly available data, finding the right dataset
to use is increasingly difficult. Any research question imposes explicit and
implicit constraints on how well a given dataset will enable researchers to
answer this question, such as dataset size, modality, and domain. We
operationalize the task of recommending datasets given a short natural language
description of a research idea, to help people find relevant datasets for their
needs. Dataset recommendation poses unique challenges as an information
retrieval problem; datasets are hard to directly index for search and there are
no corpora readily available for this task. To facilitate this task, we build
the DataFinder Dataset which consists of a larger automatically-constructed
training set (17.5K queries) and a smaller expert-annotated evaluation set (392
queries). Using this data, we compare various information retrieval algorithms
on our test set and present a superior bi-encoder retriever for text-based
dataset recommendation. This system, trained on the DataFinder Dataset, finds
more relevant search results than existing third-party dataset search engines.
To encourage progress on dataset recommendation, we release our dataset and
models to the public.
Related papers
- Metadata-based Data Exploration with Retrieval-Augmented Generation for Large Language Models [3.7685718201378746]
This research introduces a new architecture for data exploration which employs a form of Retrieval-Augmented Generation (RAG) to enhance metadata-based data discovery.
The proposed framework offers a new method for evaluating semantic similarity among heterogeneous data sources.
arXiv Detail & Related papers (2024-10-05T17:11:37Z) - Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation.
On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%.
We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z) - Imitation Learning Datasets: A Toolkit For Creating Datasets, Training
Agents and Benchmarking [0.9944647907864256]
Imitation learning field requires expert data to train agents in a task.
Most often, this learning approach suffers from the absence of available data.
This work aims to address these issues by creating Imitation Learning datasets.
arXiv Detail & Related papers (2024-03-01T14:18:46Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Going beyond research datasets: Novel intent discovery in the industry
setting [60.90117614762879]
This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform.
We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision.
We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv.
arXiv Detail & Related papers (2023-05-09T14:21:29Z) - Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset
Evaluation for Text Classification [39.01740345482624]
In this paper, we ask the research question of whether all the datasets in the benchmark are necessary.
Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power.
arXiv Detail & Related papers (2022-05-04T15:33:00Z) - DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data.
toolname has features for dataset recommendation and global vision analysis.
So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z) - Simplified Data Wrangling with ir_datasets [37.558383796758356]
ir_datases is a tool for acquiring, managing, and performing typical operations over datasets used in Information Retrieval (IR) experiments.
This tool provides both a python and command line interface to numerous IR datasets and benchmarks.
arXiv Detail & Related papers (2021-03-03T09:38:36Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.