Principles for Open Data Curation: A Case Study with the New York City 311 Service Request Data
- URL: http://arxiv.org/abs/2502.08649v1
- Date: Tue, 14 Jan 2025 12:06:20 GMT
- Title: Principles for Open Data Curation: A Case Study with the New York City 311 Service Request Data
- Authors: David Hussey, Jun Yan,
- Abstract summary: The City of New York (NYC) has been at the forefront of this movement since the enactment of the Open Data Law in 2012.
The portal currently hosts 2,700 datasets, serving as a crucial resource for research across various domains.
The effective use of open data relies heavily on data quality and usability, challenges that remain insufficiently addressed in the literature.
- Score: 2.3464946883680864
- License:
- Abstract: In the early 21st century, the open data movement began to transform societies and governments by promoting transparency, innovation, and public engagement. The City of New York (NYC) has been at the forefront of this movement since the enactment of the Open Data Law in 2012, creating the NYC Open Data portal. The portal currently hosts 2,700 datasets, serving as a crucial resource for research across various domains, including health, urban development, and transportation. However, the effective use of open data relies heavily on data quality and usability, challenges that remain insufficiently addressed in the literature. This paper examines these challenges via a case study of the NYC 311 Service Request dataset, identifying key issues in data validity, consistency, and curation efficiency. We propose a set of data curation principles, tailored for government-released open data, to address these challenges. Our findings highlight the importance of harmonized field definitions, streamlined storage, and automated quality checks, offering practical guidelines for improving the reliability and utility of open datasets.
Related papers
- Differentially Private Data Release on Graphs: Inefficiencies and Unfairness [48.96399034594329]
This paper characterizes the impact of Differential Privacy on bias and unfairness in the context of releasing information about networks.
We consider a network release problem where the network structure is known to all, but the weights on edges must be released privately.
Our work provides theoretical foundations and empirical evidence into the bias and unfairness arising due to privacy in these networked decision problems.
arXiv Detail & Related papers (2024-08-08T08:37:37Z) - Future and AI-Ready Data Strategies: Response to DOC RFI on AI and Open Government Data Assets [6.659894897434807]
The following is a response to the US Department of Commerce's Request for Information (RFI) regarding AI and Open Government Data Assets.
We commend the Department for its initiative in seeking public insights on the organization and sharing of data.
In our response, we outline best practices and key considerations for AI and the Department of Commerce's Open Government Data Assets.
arXiv Detail & Related papers (2024-07-26T07:31:32Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing.
Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data.
Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Lessons from the AdKDD'21 Privacy-Preserving ML Challenge [57.365745458033075]
A prominent proposal at W3C only allows sharing advertising signals through aggregated, differentially private reports of past displays.
To study this proposal extensively, an open Privacy-Preserving Machine Learning Challenge took place at AdKDD'21.
A key finding is that learning models on large, aggregated data in the presence of a small set of unaggregated data points can be surprisingly efficient and cheap.
arXiv Detail & Related papers (2022-01-31T11:09:59Z) - Federated Learning for Big Data: A Survey on Opportunities,
Applications, and Future Directions [5.124701758921822]
We present a survey on the use of federated learning for big data services and applications.
We review the use of FL for key big data services, including big data acquisition, big data storage, big data analytics, and big data privacy preservation.
arXiv Detail & Related papers (2021-10-08T14:36:43Z) - Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets
for Public Use [0.4462475518267084]
CDC has collected person-level, de-identified data from jurisdictions and currently has over 8 million records.
Data elements were included based on the usefulness, public request, and privacy implications.
Specific field values were suppressed to reduce risk of reidentification and exposure of confidential information.
arXiv Detail & Related papers (2021-01-13T14:24:20Z) - Open Data Quality Evaluation: A Comparative Analysis of Open Data in
Latvia [0.0]
The research discusses how (open) data quality could be assessed.
One specific approach is applied to several Latvian open data sets.
There are also underlined common data quality problems detected in Latvian open data and in open data of 3 European countries.
arXiv Detail & Related papers (2020-07-09T10:43:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.