Towards Better Modeling with Missing Data: A Contrastive Learning-based
Visual Analytics Perspective
- URL: http://arxiv.org/abs/2309.09744v1
- Date: Mon, 18 Sep 2023 13:16:24 GMT
- Title: Towards Better Modeling with Missing Data: A Contrastive Learning-based
Visual Analytics Perspective
- Authors: Laixin Xie, Yang Ouyang, Longfei Chen, Ziming Wu, Quan Li
- Abstract summary: Missing data can pose a challenge for machine learning (ML) modeling.
Current approaches are categorized into feature imputation and label prediction.
This study proposes a Contrastive Learning framework to model observed data with missing values.
- Score: 7.577040836988683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Missing data can pose a challenge for machine learning (ML) modeling. To
address this, current approaches are categorized into feature imputation and
label prediction and are primarily focused on handling missing data to enhance
ML performance. These approaches rely on the observed data to estimate the
missing values and therefore encounter three main shortcomings in imputation,
including the need for different imputation methods for various missing data
mechanisms, heavy dependence on the assumption of data distribution, and
potential introduction of bias. This study proposes a Contrastive Learning (CL)
framework to model observed data with missing values, where the ML model learns
the similarity between an incomplete sample and its complete counterpart and
the dissimilarity between other samples. Our proposed approach demonstrates the
advantages of CL without requiring any imputation. To enhance interpretability,
we introduce CIVis, a visual analytics system that incorporates interpretable
techniques to visualize the learning process and diagnose the model status.
Users can leverage their domain knowledge through interactive sampling to
identify negative and positive pairs in CL. The output of CIVis is an optimized
model that takes specified features and predicts downstream tasks. We provide
two usage scenarios in regression and classification tasks and conduct
quantitative experiments, expert interviews, and a qualitative user study to
demonstrate the effectiveness of our approach. In short, this study offers a
valuable contribution to addressing the challenges associated with ML modeling
in the presence of missing data by providing a practical solution that achieves
high predictive accuracy and model interpretability.
Related papers
- The Common Stability Mechanism behind most Self-Supervised Learning
Approaches [64.40701218561921]
We provide a framework to explain the stability mechanism of different self-supervised learning techniques.
We discuss the working mechanism of contrastive techniques like SimCLR, non-contrastive techniques like BYOL, SWAV, SimSiam, Barlow Twins, and DINO.
We formulate different hypotheses and test them using the Imagenet100 dataset.
arXiv Detail & Related papers (2024-02-22T20:36:24Z) - Revisiting Demonstration Selection Strategies in In-Context Learning [66.11652803887284]
Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL)
In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent.
We propose a data- and model-dependent demonstration selection method, textbfTopK + ConE, based on the assumption that textitthe performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples.
arXiv Detail & Related papers (2024-01-22T16:25:27Z) - AttributionScanner: A Visual Analytics System for Model Validation with Metadata-Free Slice Finding [29.07617945233152]
Data slice finding is an emerging technique for validating machine learning (ML) models by identifying and analyzing subgroups in a dataset that exhibit poor performance.
This approach faces significant challenges, including the laborious and costly requirement for additional metadata.
We introduce AttributionScanner, an innovative human-in-the-loop Visual Analytics (VA) system, designed for metadata-free data slice finding.
Our system identifies interpretable data slices that involve common model behaviors and visualizes these patterns through an Attribution Mosaic design.
arXiv Detail & Related papers (2024-01-12T09:17:32Z) - Assessing Privacy Risks in Language Models: A Case Study on
Summarization Tasks [65.21536453075275]
We focus on the summarization task and investigate the membership inference (MI) attack.
We exploit text similarity and the model's resistance to document modifications as potential MI signals.
We discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.
arXiv Detail & Related papers (2023-10-20T05:44:39Z) - Think Twice: Measuring the Efficiency of Eliminating Prediction
Shortcuts of Question Answering Models [3.9052860539161918]
We propose a simple method for measuring a scale of models' reliance on any identified spurious feature.
We assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA)
We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features.
arXiv Detail & Related papers (2023-05-11T14:35:00Z) - A prediction and behavioural analysis of machine learning methods for
modelling travel mode choice [0.26249027950824505]
We conduct a systematic comparison of different modelling approaches, across multiple modelling problems, in terms of the key factors likely to affect model choice.
Results indicate that the models with the highest disaggregate predictive performance provide poorer estimates of behavioural indicators and aggregate mode shares.
It is also observed that the MNL model performs robustly in a variety of situations, though ML techniques can improve the estimates of behavioural indices such as Willingness to Pay.
arXiv Detail & Related papers (2023-01-11T11:10:32Z) - Discriminative Multimodal Learning via Conditional Priors in Generative
Models [21.166519800652047]
This research studies the realistic scenario in which all modalities and class labels are available for model training.
We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities.
arXiv Detail & Related papers (2021-10-09T17:22:24Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z) - Data and Model Dependencies of Membership Inference Attack [13.951470844348899]
We provide an empirical analysis of the impact of both the data and ML model properties on the vulnerability of ML techniques to MIA.
Our results reveal the relationship between MIA accuracy and properties of the dataset and training model in use.
We propose using those data and model properties as regularizers to protect ML models against MIA.
arXiv Detail & Related papers (2020-02-17T09:35:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.