A compendium of data sources for data science, machine learning, and
artificial intelligence
- URL: http://arxiv.org/abs/2309.05682v1
- Date: Sun, 10 Sep 2023 19:15:22 GMT
- Title: A compendium of data sources for data science, machine learning, and
artificial intelligence
- Authors: Paul Bilokon and Oleksandr Bilokon and Saeed Amen
- Abstract summary: Recent advances in data science, machine learning, and artificial intelligence are leading to an increasing demand for data.
Data sources are application-specific, and it is impossible to produce an exhaustive list of such data sources.
The goal of this publication is to provide just such an (inevitably incomplete) list -- or compendium -- of data sources across multiple areas of applications.
- Score: 17.857341127079305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in data science, machine learning, and artificial
intelligence, such as the emergence of large language models, are leading to an
increasing demand for data that can be processed by such models. While data
sources are application-specific, and it is impossible to produce an exhaustive
list of such data sources, it seems that a comprehensive, rather than complete,
list would still benefit data scientists and machine learning experts of all
levels of seniority. The goal of this publication is to provide just such an
(inevitably incomplete) list -- or compendium -- of data sources across
multiple areas of applications, including finance and economics, legal (laws
and regulations), life sciences (medicine and drug discovery), news sentiment
and social media, retail and ecommerce, satellite imagery, and shipping and
logistics, and sports.
Related papers
- DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks.
This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions.
Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z) - Research on the Spatial Data Intelligent Foundation Model [70.47828328840912]
This report focuses on spatial data intelligent large models, delving into the principles, methods, and cutting-edge applications of these models.
It provides an in-depth discussion on the definition, development history, current status, and trends of spatial data intelligent large models.
The report systematically elucidates the key technologies of spatial data intelligent large models and their applications in urban environments, aerospace remote sensing, geography, transportation, and other scenarios.
arXiv Detail & Related papers (2024-05-30T06:21:34Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Book Chapter in Computational Demography and Health [0.0]
Computational demography, big data, and precision health research includes social scientists, physical scientists, engineers, data scientists, and disease experts.
This work has changed how we use administrative data, conduct surveys, and allow for complex behavioral studies via big data.
This chapter reviews this emerging field's new data sources, methods, and applications.
arXiv Detail & Related papers (2023-09-08T17:30:33Z) - Data-centric Artificial Intelligence: A Survey [47.24049907785989]
Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI.
In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals.
We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle.
arXiv Detail & Related papers (2023-03-17T17:44:56Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Computational Skills by Stealth in Secondary School Data Science [16.960800464621993]
We discuss a proposal for the stealth development of computational skills in students' first exposure to data science.
The intent of this approach is to support students, regardless of interest and self-efficacy in coding, in becoming data-driven learners.
arXiv Detail & Related papers (2020-10-08T09:11:51Z) - A Survey on Data Pricing: from Economics to Data Science [61.72030615854597]
We examine various motivations behind data pricing and understand the economics of data pricing.
We discuss both digital products and data products.
We consider a series of challenges and directions for future work.
arXiv Detail & Related papers (2020-09-09T19:31:38Z) - A fresh look at introductory data science [0.0]
We present a case study of an introductory undergraduate course in data science that is designed to address these needs.
This course has no pre-requisites and serves a wide audience of aspiring statistics and data science majors as well as humanities, social sciences, and natural sciences students.
We discuss the unique set of challenges posed by offering such a course and in light of these challenges, we present a detailed discussion into the pedagogical design elements, content, structure, computational infrastructure, and the assessment methodology of the course.
arXiv Detail & Related papers (2020-08-01T18:39:34Z) - Data Science: A Comprehensive Overview [42.98602883069444]
The twenty-first century has ushered in the age of big data and data economy, in which data DNA has become an intrinsic constituent of all data-based organisms.
An appropriate understanding of data DNA and its organisms relies on the new field of data science and its keystone, analytics.
This article is the first in the field to draw a comprehensive big picture, in addition to offering rich observations, lessons and thinking about data science and analytics.
arXiv Detail & Related papers (2020-07-01T02:33:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.