IoT Data Trust Evaluation via Machine Learning
- URL: http://arxiv.org/abs/2308.11638v1
- Date: Tue, 15 Aug 2023 05:44:01 GMT
- Title: IoT Data Trust Evaluation via Machine Learning
- Authors: Timothy Tadj, Reza Arablouei, Volkan Dedeoglu
- Abstract summary: We propose a data synthesis method, called random walk infilling (RWI), to augment IoT time-series datasets by synthesizing untrustworthy data.
We also extract new features from IoT time-series sensor data that effectively capture its auto-correlation.
These features can be used to learn ML models for recognizing the trustworthiness of IoT sensor data.
- Score: 4.0116218566600566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Various approaches based on supervised or unsupervised machine learning (ML)
have been proposed for evaluating IoT data trust. However, assessing their
real-world efficacy is hard mainly due to the lack of related
publicly-available datasets that can be used for benchmarking. Since obtaining
such datasets is challenging, we propose a data synthesis method, called random
walk infilling (RWI), to augment IoT time-series datasets by synthesizing
untrustworthy data from existing trustworthy data. Thus, RWI enables us to
create labeled datasets that can be used to develop and validate ML models for
IoT data trust evaluation. We also extract new features from IoT time-series
sensor data that effectively capture its auto-correlation as well as its
cross-correlation with the data of the neighboring (peer) sensors. These
features can be used to learn ML models for recognizing the trustworthiness of
IoT sensor data. Equipped with our synthesized ground-truth-labeled datasets
and informative correlation-based feature, we conduct extensive experiments to
critically examine various approaches to evaluating IoT data trust via ML. The
results reveal that commonly used ML-based approaches to IoT data trust
evaluation, which rely on unsupervised cluster analysis to assign trust labels
to unlabeled data, perform poorly. This poor performance can be attributed to
the underlying unsubstantiated assumption that clustering provides reliable
labels for data trust, a premise that is found to be untenable. The results
also show that the ML models learned from datasets augmented via RWI while
using the proposed features generalize well to unseen data and outperform
existing related approaches. Moreover, we observe that a semi-supervised ML
approach that requires only about 10% of the data labeled offers competitive
performance while being practically more appealing compared to the
fully-supervised approaches.
Related papers
- DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.
Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.
Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.
We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers [0.0]
This paper presents a machine learning framework that automates dataset mention detection across research domains.
We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset.
At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall.
arXiv Detail & Related papers (2025-02-14T16:16:02Z) - Demystifying Spectral Bias on Real-World Data [2.3020018305241337]
Kernel ridge regression (KRR) and Gaussian processes (GPs) are fundamental tools in statistics and machine learning.
We consider cross-dataset learnability and show that one may use eigenvalues and eigenfunctions associated with highly idealized data measures to reveal spectral bias on complex datasets.
arXiv Detail & Related papers (2024-06-04T18:00:00Z) - Truthful Dataset Valuation by Pointwise Mutual Information [28.63827288801458]
We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data.
Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset.
arXiv Detail & Related papers (2024-05-28T15:04:17Z) - Self-Supervised Learning for User Localization [8.529237718266042]
Machine learning techniques have shown remarkable accuracy in localization tasks.
Their dependency on vast amounts of labeled data, particularly Channel State Information (CSI) and corresponding coordinates, remains a bottleneck.
We propose a pioneering approach that leverages self-supervised pretraining on unlabeled data to boost the performance of supervised learning for user localization based on CSI.
arXiv Detail & Related papers (2024-04-19T21:49:10Z) - FLIGAN: Enhancing Federated Learning with Incomplete Data using GAN [1.5749416770494706]
Federated Learning (FL) provides a privacy-preserving mechanism for distributed training of machine learning models on networked devices.
We propose FLIGAN, a novel approach to address the issue of data incompleteness in FL.
Our methodology adheres to FL's privacy requirements by generating synthetic data in a federated manner without sharing the actual data in the process.
arXiv Detail & Related papers (2024-03-25T16:49:38Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - Data Collaboration Analysis applied to Compound Datasets and the
Introduction of Projection data to Non-IID settings [6.037276428689637]
Federated learning has been applied to compound datasets to increase their prediction accuracy while safeguarding potentially proprietary information.
We propose an alternative method of distributed machine learning to chemical compound data from open sources, called data collaboration analysis (DCPd)
DCPd exhibited a negligible decline in classification accuracy in experiments with different degrees of label bias.
arXiv Detail & Related papers (2023-08-01T04:37:08Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Promises and Pitfalls of Threshold-based Auto-labeling [17.349289155257715]
Threshold-based auto-labeling (TBAL)
We derive complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data.
We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.
arXiv Detail & Related papers (2022-11-22T22:53:17Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for
Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications.
We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN)
We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.