Packaging code for reproducible research in the public sector
- URL: http://arxiv.org/abs/2305.16205v1
- Date: Thu, 25 May 2023 16:07:24 GMT
- Title: Packaging code for reproducible research in the public sector
- Authors: Federico Botta, Robin Lovelace, Laura Gilbert, Arthur Turrell
- Abstract summary: jtstats project consists of R and Python packages for importing, processing, and visualising large and complex datasets.
Jtstats shows how domain specific packages can enable reproducible research within the public sector and beyond.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The effective and ethical use of data to inform decision-making offers huge
value to the public sector, especially when delivered by transparent,
reproducible, and robust data processing workflows. One way that governments
are unlocking this value is through making their data publicly available,
allowing more people and organisations to derive insights. However, open data
is not enough in many cases: publicly available datasets need to be accessible
in an analysis-ready form from popular data science tools, such as R and
Python, for them to realise their full potential.
This paper explores ways to maximise the impact of open data with reference
to a case study of packaging code to facilitate reproducible analysis. We
present the jtstats project, which consists of R and Python packages for
importing, processing, and visualising large and complex datasets representing
journey times, for many modes and purposes at multiple geographic levels,
released by the UK Department of Transport. jtstats shows how domain specific
packages can enable reproducible research within the public sector and beyond,
saving duplicated effort and reducing the risks of errors from repeated
analyses. We hope that the jtstats project inspires others, particularly those
in the public sector, to add value to their data sets by making them more
accessible.
Related papers
- Enabling Advanced Land Cover Analytics: An Integrated Data Extraction Pipeline for Predictive Modeling with the Dynamic World Dataset [1.3757956340051605]
We present a flexible and efficient end to end pipeline for working with the Dynamic World dataset.
This includes a pre-processing and representation framework which tackles noise removal, efficient extraction of large amounts of data, and re-representation of LULC data.
To demonstrate the power of our pipeline, we use it to extract data for an urbanization prediction problem and build a suite of machine learning models with excellent performance.
arXiv Detail & Related papers (2024-10-11T16:13:01Z) - Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning [1.8270184406083445]
We explore using large language models (LLM) and prompting strategies to automatically extract dimensions from documents.
Our approach could aid data publishers and practitioners in creating machine-readable documentation.
We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results.
arXiv Detail & Related papers (2024-04-04T10:09:28Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI [41.32981860191232]
Legal and machine learning experts to systematically audit and trace 1800+ text datasets.
Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets.
frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
arXiv Detail & Related papers (2023-10-25T17:20:26Z) - LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting [65.71129509623587]
Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning.
However, the promising results achieved on current public datasets may not be applicable to practical scenarios.
We introduce the LargeST benchmark dataset, which includes a total of 8,600 sensors in California with a 5-year time coverage.
arXiv Detail & Related papers (2023-06-14T05:48:36Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Customs Import Declaration Datasets [12.306592823750385]
We introduce an import declaration dataset to facilitate the collaboration between domain experts in customs administrations and researchers from diverse domains.
The dataset contains 54,000 artificially generated trades with 22 key attributes.
We empirically show that more advanced algorithms can better detect fraud.
arXiv Detail & Related papers (2022-08-04T06:20:20Z) - The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender.
We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z) - Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible
Off-Policy Evaluation [10.135719343010178]
Off-policy evaluation (OPE) aims to estimate the performance of hypothetical policies using data generated by a different policy.
There is, however, no real-world public dataset that enables the evaluation of OPE.
We present Open Bandit dataset, a public logged bandit dataset collected on a large-scale fashion e-commerce platform, ZOZOTOWN.
arXiv Detail & Related papers (2020-08-17T08:23:50Z) - Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research.
OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains.
For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.