Datasets: A Community Library for Natural Language Processing
- URL: http://arxiv.org/abs/2109.02846v1
- Date: Tue, 7 Sep 2021 03:59:22 GMT
- Title: Datasets: A Community Library for Natural Language Processing
- Authors: Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek
Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame,
Julien Plu, Lewis Tunstall, Joe Davison, Mario \v{S}a\v{s}ko, Gunjan
Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh,
Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain
Gugger, Cl\'ement Delangue, Th\'eo Matussi\`ere, Lysandre Debut, Stas Bekman,
Pierric Cistac, Thibault Goehringer, Victor Mustar, Fran\c{c}ois Lagunas,
Alexander M. Rush, and Thomas Wolf
- Abstract summary: datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
- Score: 55.48866401721244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The scale, variety, and quantity of publicly-available NLP datasets has grown
rapidly as researchers propose new tasks, larger models, and novel benchmarks.
Datasets is a community library for contemporary NLP designed to support this
ecosystem. Datasets aims to standardize end-user interfaces, versioning, and
documentation, while providing a lightweight front-end that behaves similarly
for small datasets as for internet-scale corpora. The design of the library
incorporates a distributed, community-driven approach to adding datasets and
documenting usage. After a year of development, the library now includes more
than 650 unique datasets, has more than 250 contributors, and has helped
support a variety of novel cross-dataset research projects and shared tasks.
The library is available at https://github.com/huggingface/datasets.
Related papers
- Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation.
On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%.
We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - trajdata: A Unified Interface to Multiple Human Trajectory Datasets [32.93180256927027]
We present trajdata, a unified interface to multiple human trajectory datasets.
Trajdata provides a simple, uniform, and efficient representation and API for trajectory and map data.
arXiv Detail & Related papers (2023-07-26T02:45:59Z) - Towards Federated Foundation Models: Scalable Dataset Pipelines for
Group-Structured Learning [11.205441416962284]
We introduce dataset grouper, a library to create large-scale group-structured datasets.
It enables federated learning simulation at the scale of foundation models.
arXiv Detail & Related papers (2023-07-18T20:27:45Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - SequeL: A Continual Learning Library in PyTorch and JAX [50.33956216274694]
SequeL is a library for Continual Learning that supports both PyTorch and JAX frameworks.
It provides a unified interface for a wide range of Continual Learning algorithms, including regularization-based approaches, replay-based approaches, and hybrid approaches.
We release SequeL as an open-source library, enabling researchers and developers to easily experiment and extend the library for their own purposes.
arXiv Detail & Related papers (2023-04-21T10:00:22Z) - Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of
Downstream Tasks [0.007696728525672149]
In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families.
Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages.
arXiv Detail & Related papers (2022-10-26T13:45:14Z) - DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data.
toolname has features for dataset recommendation and global vision analysis.
So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z) - A Large-Scale Multi-Document Summarization Dataset from the Wikipedia
Current Events Portal [10.553314461761968]
Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries.
This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters.
arXiv Detail & Related papers (2020-05-20T14:33:33Z) - A Large Dataset of Historical Japanese Documents with Complex Layouts [5.343406649012619]
HJDataset is a large dataset of historical Japanese documents with complex layouts.
It contains over 250,000 layout element annotations seven types.
A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors.
arXiv Detail & Related papers (2020-04-18T18:38:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.