Ocean Data Quality Assessment through Outlier Detection-enhanced Active
Learning
- URL: http://arxiv.org/abs/2312.10817v1
- Date: Sun, 17 Dec 2023 20:57:22 GMT
- Title: Ocean Data Quality Assessment through Outlier Detection-enhanced Active
Learning
- Authors: Na Li, Yiyang Qi, Ruyue Xin, Zhiming Zhao
- Abstract summary: The Argo network, dedicated to ocean profiling, generates a vast volume of observatory data.
Existing methods, including machine learning, fall short due to limited labeled data imbalanced datasets.
We propose an O framework for ocean data quality assessment employing AL to reduce human experts' workload.
- Score: 4.274369283265131
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ocean and climate research benefits from global ocean observation initiatives
such as Argo, GLOSS, and EMSO. The Argo network, dedicated to ocean profiling,
generates a vast volume of observatory data. However, data quality issues from
sensor malfunctions and transmission errors necessitate stringent quality
assessment. Existing methods, including machine learning, fall short due to
limited labeled data and imbalanced datasets. To address these challenges, we
propose an ODEAL framework for ocean data quality assessment, employing AL to
reduce human experts' workload in the quality assessment workflow and
leveraging outlier detection algorithms for effective model initialization. We
also conduct extensive experiments on five large-scale realistic Argo datasets
to gain insights into our proposed method, including the effectiveness of AL
query strategies and the initial set construction approach. The results suggest
that our framework enhances quality assessment efficiency by up to 465.5% with
the uncertainty-based query strategy compared to random sampling and minimizes
overall annotation costs by up to 76.9% using the initial set built with
outlier detectors.
Related papers
- How Reliable Is Human Feedback For Aligning Large Language Models? [24.66495636695214]
We conduct a comprehensive study and provide an in-depth analysis of human feedback data.
We identify six key sources of unreliability, such as mis-labeling, subjective preferences, differing criteria and thresholds for helpfulness and harmlessness.
We propose Source-Aware Cleaning, an automatic data-cleaning method guided by the insight of our qualitative analysis, to significantly improve data quality.
arXiv Detail & Related papers (2024-10-02T19:03:42Z) - Assessment of Spectral based Solutions for the Detection of Floating Marine Debris [2.3558144417896587]
Recently, the Marine Debris Archive (MARIDA) has been released as a standard dataset to develop and evaluate Machine Learning (ML) algorithms for detection of Marine Plastic Debris.
In this work, an assessment of spectral based solutions is proposed by evaluating performance on MARIDA dataset.
arXiv Detail & Related papers (2024-08-19T17:47:22Z) - Quanv4EO: Empowering Earth Observation by means of Quanvolutional Neural Networks [62.12107686529827]
This article highlights a significant shift towards leveraging quantum computing techniques in processing large volumes of remote sensing data.
The proposed Quanv4EO model introduces a quanvolution method for preprocessing multi-dimensional EO data.
Key findings suggest that the proposed model not only maintains high precision in image classification but also shows improvements of around 5% in EO use cases.
arXiv Detail & Related papers (2024-07-24T09:11:34Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling [66.3072381478251]
Reward hacking, also termed reward overoptimization, remains a critical challenge.
We propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective.
We show that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets.
arXiv Detail & Related papers (2024-02-14T17:49:07Z) - Innovative Horizons in Aerial Imagery: LSKNet Meets DiffusionDet for
Advanced Object Detection [55.2480439325792]
We present an in-depth evaluation of an object detection model that integrates the LSKNet backbone with the DiffusionDet head.
The proposed model achieves a mean average precision (MAP) of approximately 45.7%, which is a significant improvement.
This advancement underscores the effectiveness of the proposed modifications and sets a new benchmark in aerial image analysis.
arXiv Detail & Related papers (2023-11-21T19:49:13Z) - Quality Assurance of A GPT-based Sentiment Analysis System: Adversarial
Review Data Generation and Detection [10.567108680774782]
GPT-based sentiment analysis model is first constructed and studied as the reference in AI quality analysis.
Quality analysis related to data adequacy is implemented, including employing the content-based approach to generate reasonable adversarial review comments.
Experiments based on Amazon.com review data and a fine-tuned GPT model were implemented.
arXiv Detail & Related papers (2023-10-09T00:01:05Z) - One-Shot Learning for Periocular Recognition: Exploring the Effect of
Domain Adaptation and Data Bias on Deep Representations [59.17685450892182]
We investigate the behavior of deep representations in widely used CNN models under extreme data scarcity for One-Shot periocular recognition.
We improved state-of-the-art results that made use of networks trained with biometric datasets with millions of images.
Traditional algorithms like SIFT can outperform CNNs in situations with limited data.
arXiv Detail & Related papers (2023-07-11T09:10:16Z) - Quality In / Quality Out: Assessing Data quality in an Anomaly Detection
Benchmark [0.13764085113103217]
We show that relatively minor modifications on the same benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly detection) cause significantly more impact on model performance than the specific Machine Learning technique considered.
Our findings illustrate the need to devote more attention into (automatic) data quality assessment and optimization techniques in the context of autonomous networks.
arXiv Detail & Related papers (2023-05-31T12:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.