Related papers: Contrastive and Variational Approaches in Self-Supervised Learning for Complex Data Mining

Contrastive and Variational Approaches in Self-Supervised Learning for Complex Data Mining

URL: http://arxiv.org/abs/2504.04032v1
Date: Sat, 05 Apr 2025 02:55:44 GMT
Title: Contrastive and Variational Approaches in Self-Supervised Learning for Complex Data Mining
Authors: Yingbin Liang, Lu Dai, Shuo Shi, Minghao Dai, Junliang Du, Haige Wang,
Abstract summary: This study analyzed the role of self-supervised learning methods in complex data mining through systematic experiments.<n>Results show that the model has strong adaptability on different data sets, can effectively extract high-quality features from unlabeled data, and improves classification accuracy.
Score: 36.772769830368475
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Complex data mining has wide application value in many fields, especially in the feature extraction and classification tasks of unlabeled data. This paper proposes an algorithm based on self-supervised learning and verifies its effectiveness through experiments. The study found that in terms of the selection of optimizer and learning rate, the combination of AdamW optimizer and 0.002 learning rate performed best in all evaluation indicators, indicating that the adaptive optimization method can improve the performance of the model in complex data mining tasks. In addition, the ablation experiment further analyzed the contribution of each module. The results show that contrastive learning, variational modules, and data augmentation strategies play a key role in the generalization ability and robustness of the model. Through the convergence curve analysis of the loss function, the experiment verifies that the method can converge stably during the training process and effectively avoid serious overfitting. Further experimental results show that the model has strong adaptability on different data sets, can effectively extract high-quality features from unlabeled data, and improves classification accuracy. At the same time, under different data distribution conditions, the method can still maintain high detection accuracy, proving its applicability in complex data environments. This study analyzed the role of self-supervised learning methods in complex data mining through systematic experiments and verified its advantages in improving feature extraction quality, optimizing classification performance, and enhancing model stability

Related papers

Optimizing Active Learning in Vision-Language Models via Parameter-Efficient Uncertainty Calibration [6.7181844004432385]
We introduce a novel parameter-efficient learning methodology that incorporates uncertainty calibration loss within the Active Learning framework.<n>We demonstrate that our solution can match and exceed the performance of complex feature-based sampling techniques.
arXiv Detail & Related papers (2025-07-29T06:08:28Z)
The Impact of Feature Scaling In Machine Learning: Effects on Regression and Classification Tasks [0.8388858079753069]
This research addresses the critical lack of comprehensive studies on feature scaling by systematically evaluating 12 scaling techniques across 14 different Machine Learning algorithms and 16 datasets for classification and regression tasks.<n>We meticulously analyzed impacts on predictive performance (using metrics such as accuracy, MAE, MSE, and $R2$) and computational costs (training time, inference time, and memory usage).
arXiv Detail & Related papers (2025-06-09T22:32:51Z)
SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning [30.34323856102674]
Imitation learning advances robot capabilities by enabling the acquisition of diverse behaviors from human demonstrations.<n>Existing robotic curation approaches rely on costly manual annotations and perform curation at a coarse granularity.<n>We introduce SCIZOR, a self-supervised data curation framework that filters out low-quality state-action pairs to improve the performance of imitation learning policies.
arXiv Detail & Related papers (2025-05-28T17:45:05Z)
Generative Data Imputation for Sparse Learner Performance Data Using Generative Adversarial Imputation Networks [3.0800525961862992]
Missing responses due to skips or incomplete attempts create data sparsity.<n>We propose a generative imputation approach using Generative Adrial Imputation Networks (GAIN)<n>Our method features a three-dimensional (3D) framework (learners, questions, and attempts), flexibly accommodating various sparsity levels.
arXiv Detail & Related papers (2025-03-23T06:11:53Z)
Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search [59.75749613951193]
We propose Data Influence-oriented Tree Search (DITS) to guide both tree search and data selection.<n>By leveraging influence scores, we effectively identify the most impactful data for system improvement.<n>We derive influence score estimation methods tailored for non-differentiable metrics.
arXiv Detail & Related papers (2025-02-02T23:20:16Z)
Leveraging Semi-Supervised Learning to Enhance Data Mining for Image Classification under Limited Labeled Data [35.431340001608476]
Traditional data mining methods are inadequate when faced with large-scale, high-dimensional and complex data. This study introduces semi-supervised learning methods, aiming to improve the algorithm's ability to utilize unlabeled data. Specifically, we adopt a self-training method and combine it with a convolutional neural network (CNN) for image feature extraction and classification.
arXiv Detail & Related papers (2024-11-27T18:59:50Z)
Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI [17.242331892899543]
Learning performance data describe correct and incorrect answers or problem-solving attempts in adaptive learning. Learning performance data tend to be highly sparse (80%(sim)90% missing observations) in most real-world applications due to adaptive item selection. This article proposes a systematic framework for augmenting learner data to address data sparsity in learning performance data.
arXiv Detail & Related papers (2024-09-24T00:25:07Z)
Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models [36.05242956018461]
In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection. We first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets. We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models.
arXiv Detail & Related papers (2024-05-06T21:34:46Z)
Exploring Learning Complexity for Efficient Downstream Dataset Pruning [8.990878450631596]
Existing dataset pruning methods require training on the entire dataset. We propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC) Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters.
arXiv Detail & Related papers (2024-02-08T02:29:33Z)
TRIAGE: Characterizing and auditing training data for improved regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors. TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score. We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
Feature Weaken: Vicinal Data Augmentation for Classification [1.7013938542585925]
We use Feature Weaken to construct the vicinal data distribution with the same cosine similarity for model training. This work can not only improve the classification performance and generalization of the model, but also stabilize the model training and accelerate the model convergence.
arXiv Detail & Related papers (2022-11-20T11:00:23Z)
Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees [1.5605040219256345]
We propose a method based on metrics from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We show results on detecting noisy labels in order clean datasets, improving models' metrics in synthetic and real public datasets, as well as on a industry case in which we deployed a model based on the proposed solution.
arXiv Detail & Related papers (2022-10-20T15:02:49Z)
Meta-learning One-class Classifiers with Eigenvalue Solvers for Supervised Anomaly Detection [55.888835686183995]
We propose a neural network-based meta-learning method for supervised anomaly detection. We experimentally demonstrate that the proposed method achieves better performance than existing anomaly detection and few-shot learning methods.
arXiv Detail & Related papers (2021-03-01T01:43:04Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.