Training from Zero: Radio Frequency Machine Learning Data Quantity Forecasting
- URL: http://arxiv.org/abs/2205.03703v2
- Date: Fri, 14 Jun 2024 17:33:28 GMT
- Title: Training from Zero: Radio Frequency Machine Learning Data Quantity Forecasting
- Authors: William H. Clark IV, Alan J. Michaels,
- Abstract summary: The data used during training in any given application space is directly tied to the performance of the system once deployed.
One of the underlying rule of thumbs used within the machine learning space is that more data leads to better models.
This work examines a modulation classification problem in the Radio Frequency domain space.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The data used during training in any given application space is directly tied to the performance of the system once deployed. While there are many other factors that go into producing high performance models within machine learning, there is no doubt that the data used to train a system provides the foundation from which to build. One of the underlying rule of thumb heuristics used within the machine learning space is that more data leads to better models, but there is no easy answer for the question, "How much data is needed?" This work examines a modulation classification problem in the Radio Frequency domain space, attempting to answer the question of how much training data is required to achieve a desired level of performance, but the procedure readily applies to classification problems across modalities. The ultimate goal is determining an approach that requires the least amount of data collection to better inform a more thorough collection effort to achieve the desired performance metric. While this approach will require an initial dataset that is germane to the problem space to act as a \textit{target} dataset on which metrics are extracted, the goal is to allow for the initial data to be orders of magnitude smaller than what is required for delivering a system that achieves the desired performance. An additional benefit of the techniques presented here is that the quality of different datasets can be numerically evaluated and tied together with the quantity of data, and ultimately, the performance of the architecture in the problem domain.
Related papers
- An information-matching approach to optimal experimental design and active learning [0.9362620873652918]
We introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool.
We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics.
arXiv Detail & Related papers (2024-11-05T02:16:23Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Building Manufacturing Deep Learning Models with Minimal and Imbalanced
Training Data Using Domain Adaptation and Data Augmentation [15.333573151694576]
We propose a novel domain adaptation (DA) approach to address the problem of labeled training data scarcity for a target learning task.
Our approach works for scenarios where the source dataset and the dataset available for the target learning task have same or different feature spaces.
We evaluate our combined approach using image data for wafer defect prediction.
arXiv Detail & Related papers (2023-05-31T21:45:34Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Optimizing Data Collection for Machine Learning [87.37252958806856]
Modern deep learning systems require huge data sets to achieve impressive performance.
Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay.
We propose a new paradigm for modeling the data collection as a formal optimal data collection problem.
arXiv Detail & Related papers (2022-10-03T21:19:05Z) - How Much More Data Do I Need? Estimating Requirements for Downstream
Tasks [99.44608160188905]
Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance?
Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget.
Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
arXiv Detail & Related papers (2022-07-04T21:16:05Z) - Training Data Augmentation for Deep Learning Radio Frequency Systems [1.1199585259018459]
This work focuses on the data used during training.
In general, the examined data types each have useful contributions to a final application.
Despite the benefit of captured data, the difficulties and costs that arise from live collection often make the quantity of data needed to achieve peak performance impractical.
arXiv Detail & Related papers (2020-10-01T02:26:16Z) - Overcoming Noisy and Irrelevant Data in Federated Learning [13.963024590508038]
Federated learning is an effective way of training a machine learning model in a distributed manner from local data collected by client devices.
We propose a method for distributedly selecting relevant data, where we use a benchmark model trained on a small benchmark dataset.
The effectiveness of our proposed approach is evaluated on multiple real-world image datasets in a simulated system with a large number of clients.
arXiv Detail & Related papers (2020-01-22T22:28:47Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.