Related papers: DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions

DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions

URL: http://arxiv.org/abs/2305.16636v2
Date: Wed, 7 Jun 2023 03:08:27 GMT
Title: DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions
Authors: Vijay Viswanathan, Luyu Gao, Tongshuang Wu, Pengfei Liu and Graham Neubig
Abstract summary: We operationalize the task of recommending datasets given a short natural language description. To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set. This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
Score: 100.52917027038369
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern machine learning relies on datasets to develop and validate research ideas. Given the growth of publicly available data, finding the right dataset to use is increasingly difficult. Any research question imposes explicit and implicit constraints on how well a given dataset will enable researchers to answer this question, such as dataset size, modality, and domain. We operationalize the task of recommending datasets given a short natural language description of a research idea, to help people find relevant datasets for their needs. Dataset recommendation poses unique challenges as an information retrieval problem; datasets are hard to directly index for search and there are no corpora readily available for this task. To facilitate this task, we build the DataFinder Dataset which consists of a larger automatically-constructed training set (17.5K queries) and a smaller expert-annotated evaluation set (392 queries). Using this data, we compare various information retrieval algorithms on our test set and present a superior bi-encoder retriever for text-based dataset recommendation. This system, trained on the DataFinder Dataset, finds more relevant search results than existing third-party dataset search engines. To encourage progress on dataset recommendation, we release our dataset and models to the public.

Related papers

Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System [8.096082871461311]
Pneuma is a retrieval-augmented generation (RAG) system designed to efficiently and effectively discover tabular data. For table representation, Pneuma preserves schema and row-level information to ensure comprehensive data understanding. For table retrieval, Pneuma augments LLMs with traditional information retrieval techniques, such as full-text and vector search.
arXiv Detail & Related papers (2025-04-12T13:20:50Z)
Making Sense of Data in the Wild: Data Analysis Automation at Scale [0.1747623282473278]
We propose a novel approach that combines intelligent agents with retrieval augmented generation to automate data analysis, dataset curation and indexing at scale. We demonstrate that our approach results in more detailed dataset descriptions, higher hit rates and greater diversity in dataset retrieval tasks.
arXiv Detail & Related papers (2025-01-27T10:04:10Z)
Metadata-based Data Exploration with Retrieval-Augmented Generation for Large Language Models [3.7685718201378746]
This research introduces a new architecture for data exploration which employs a form of Retrieval-Augmented Generation (RAG) to enhance metadata-based data discovery. The proposed framework offers a new method for evaluating semantic similarity among heterogeneous data sources.
arXiv Detail & Related papers (2024-10-05T17:11:37Z)
Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation. On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z)
Imitation Learning Datasets: A Toolkit For Creating Datasets, Training Agents and Benchmarking [0.9944647907864256]
Imitation learning field requires expert data to train agents in a task. Most often, this learning approach suffers from the absence of available data. This work aims to address these issues by creating Imitation Learning datasets.
arXiv Detail & Related papers (2024-03-01T14:18:46Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Going beyond research datasets: Novel intent discovery in the industry setting [60.90117614762879]
This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv.
arXiv Detail & Related papers (2023-05-09T14:21:29Z)
Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset Evaluation for Text Classification [39.01740345482624]
In this paper, we ask the research question of whether all the datasets in the benchmark are necessary. Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power.
arXiv Detail & Related papers (2022-05-04T15:33:00Z)
DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data. toolname has features for dataset recommendation and global vision analysis. So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z)
Simplified Data Wrangling with ir_datasets [37.558383796758356]
ir_datases is a tool for acquiring, managing, and performing typical operations over datasets used in Information Retrieval (IR) experiments. This tool provides both a python and command line interface to numerous IR datasets and benchmarks.
arXiv Detail & Related papers (2021-03-03T09:38:36Z)
DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network. We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples. We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.