Elements of effective machine learning datasets in astronomy
- URL: http://arxiv.org/abs/2211.14401v2
- Date: Tue, 29 Nov 2022 06:25:23 GMT
- Title: Elements of effective machine learning datasets in astronomy
- Authors: Bernie Boscoe, Tuan Do, Evan Jones, Yunqi Li, Kevin Alfaro, Christy Ma
- Abstract summary: We identify elements of effective machine learning datasets in astronomy.
We discuss why these elements are important for astronomical applications and ways to put them in practice.
- Score: 1.552171919003135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we identify elements of effective machine learning datasets in
astronomy and present suggestions for their design and creation. Machine
learning has become an increasingly important tool for analyzing and
understanding the large-scale flood of data in astronomy. To take advantage of
these tools, datasets are required for training and testing. However, building
machine learning datasets for astronomy can be challenging. Astronomical data
is collected from instruments built to explore science questions in a
traditional fashion rather than to conduct machine learning. Thus, it is often
the case that raw data, or even downstream processed data is not in a form
amenable to machine learning. We explore the construction of machine learning
datasets and we ask: what elements define effective machine learning datasets?
We define effective machine learning datasets in astronomy to be formed with
well-defined data points, structure, and metadata. We discuss why these
elements are important for astronomical applications and ways to put them in
practice. We posit that these qualities not only make the data suitable for
machine learning, they also help to foster usable, reusable, and replicable
science practices.
Related papers
- Reinforcement learning [0.8702432681310399]
Reinforcement learning is a mechanism where we (as humans and astronomers) can teach agents of artificial intelligence to perform some of these tedious tasks.
In this paper, we will present a state of the art overview of reinforcement learning and how it can benefit astronomy.
arXiv Detail & Related papers (2024-05-16T18:03:17Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Machine learning in solar physics [0.0]
The application of machine learning in solar physics has the potential to greatly enhance our understanding of the complex processes that take place in the atmosphere of the Sun.
By using techniques such as deep learning, we are now in the position to analyze large amounts of data from solar observations.
This can help us improve our understanding of explosive events like solar flares, which can have a strong effect on the Earth environment.
arXiv Detail & Related papers (2023-06-27T08:55:20Z) - Position Paper on Dataset Engineering to Accelerate Science [1.952708415083428]
In this work, we will use the token ittextdataset to designate a structured set of data built to perform a well-defined task.
Specifically, in science, each area has unique forms to organize, gather and handle its datasets.
We advocate that science and engineering discovery processes are extreme instances of the need for such organization on datasets.
arXiv Detail & Related papers (2023-03-09T19:07:40Z) - Satellite Image Time Series Analysis for Big Earth Observation Data [50.591267188664666]
This paper describes sits, an open-source R package for satellite image time series analysis using machine learning.
We show that this approach produces high accuracy for land use and land cover maps through a case study in the Cerrado biome.
arXiv Detail & Related papers (2022-04-24T15:23:25Z) - Understanding the World Through Action [91.3755431537592]
I will argue that a general, principled, and powerful framework for utilizing unlabeled data can be derived from reinforcement learning.
I will discuss how such a procedure is more closely aligned with potential downstream tasks.
arXiv Detail & Related papers (2021-10-24T22:33:52Z) - A Spacecraft Dataset for Detection, Segmentation and Parts Recognition [42.27081423489484]
In this paper, we release a dataset for spacecraft detection, instance segmentation and part recognition.
The main contribution of this work is the development of the dataset using images of space stations and satellites.
We also provide evaluations with state-of-the-art methods in object detection and instance segmentation as a benchmark for the dataset.
arXiv Detail & Related papers (2021-06-15T14:36:56Z) - First Full-Event Reconstruction from Imaging Atmospheric Cherenkov
Telescope Real Data with Deep Learning [55.41644538483948]
The Cherenkov Telescope Array is the future of ground-based gamma-ray astronomy.
Its first prototype telescope built on-site, the Large Size Telescope 1, is currently under commissioning and taking its first scientific data.
We present for the first time the development of a full-event reconstruction based on deep convolutional neural networks and its application to real data.
arXiv Detail & Related papers (2021-05-31T12:51:42Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z) - Synthetic Data: Opening the data floodgates to enable faster, more
directed development of machine learning methods [96.92041573661407]
Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data.
Many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available to the machine learning community.
Generating synthetic data with privacy guarantees provides one such solution.
arXiv Detail & Related papers (2020-12-08T17:26:10Z) - Incorporating Physical Knowledge into Machine Learning for Planetary
Space Physics [0.0]
We build off a previous effort applying a semi-supervised physics-based classification of plasma instabilities in Saturn's magnetosphere.
We show that incorporating knowledge of these orbiting spacecraft data characteristics improves the performance and interpretability of machine learning methods.
arXiv Detail & Related papers (2020-06-02T20:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.