Simplified Data Wrangling with ir_datasets
- URL: http://arxiv.org/abs/2103.02280v1
- Date: Wed, 3 Mar 2021 09:38:36 GMT
- Title: Simplified Data Wrangling with ir_datasets
- Authors: Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman
Cohan, Nazli Goharian
- Abstract summary: ir_datases is a tool for acquiring, managing, and performing typical operations over datasets used in Information Retrieval (IR) experiments.
This tool provides both a python and command line interface to numerous IR datasets and benchmarks.
- Score: 37.558383796758356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Managing the data for Information Retrieval (IR) experiments can be
challenging. Dataset documentation is scattered across the Internet and once
one obtains a copy of the data, there are numerous different data formats to
work with. Even basic formats can have subtle dataset-specific nuances that
need to be considered for proper use. To help mitigate these challenges, we
introduce a new robust and lightweight tool (ir_datases) for acquiring,
managing, and performing typical operations over datasets used in IR. We
primarily focus on textual datasets used for ad-hoc search. This tool provides
both a python and command line interface to numerous IR datasets and
benchmarks. To our knowledge, this is the most extensive tool of its kind.
Integrations with popular IR indexing and experimentation toolkits demonstrate
the tool's utility. We also provide documentation of these datasets through the
ir_datasets catalog: https://ir-datasets.com/. The catalog acts as a hub for
information on datasets used in IR, providing core information about what data
each benchmark provides as well as links to more detailed information. We
welcome community contributions and intend to continue to maintain and grow
this tool.
Related papers
- Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning [3.623224034411137]
offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems.
Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results.
We show how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets.
arXiv Detail & Related papers (2024-09-18T14:13:24Z) - DataFinder: Scientific Dataset Recommendation from Natural Language
Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description.
To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set.
This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data.
toolname has features for dataset recommendation and global vision analysis.
So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z) - Ad-datasets: a meta-collection of data sets for autonomous driving [5.317624228510748]
ad-datasets is an online tool that provides an overview of more than 150 data sets.
It enables users to sort and filter the data sets according to 16 different categories.
arXiv Detail & Related papers (2022-02-03T23:45:48Z) - MusPy: A Toolkit for Symbolic Music Generation [32.01713268702699]
MusPy is an open source Python library for symbolic music generation.
In this paper, we present statistical analysis of the eleven datasets currently supported by MusPy.
arXiv Detail & Related papers (2020-08-05T06:16:13Z) - Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research.
OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains.
For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z) - Neural Data Server: A Large-Scale Search Engine for Transfer Learning
Data [78.74367441804183]
We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain.
NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client.
We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets.
arXiv Detail & Related papers (2020-01-09T01:21:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.