DiscoVars: A New Data Analysis Perspective -- Application in Variable
Selection for Clustering
- URL: http://arxiv.org/abs/2304.03983v1
- Date: Sat, 8 Apr 2023 10:57:19 GMT
- Title: DiscoVars: A New Data Analysis Perspective -- Application in Variable
Selection for Clustering
- Authors: Ayhan Demiriz
- Abstract summary: We present a new data analysis perspective to determine variable importance regardless of the underlying learning task.
We propose a new methodology to select important variables from the data by first creating dependency networks among all variables.
We present our tool as a Shiny app which is a user-friendly interface development environment.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present a new data analysis perspective to determine variable importance
regardless of the underlying learning task. Traditionally, variable selection
is considered an important step in supervised learning for both classification
and regression problems. The variable selection also becomes critical when
costs associated with the data collection and storage are considerably high for
cases like remote sensing. Therefore, we propose a new methodology to select
important variables from the data by first creating dependency networks among
all variables and then ranking them (i.e. nodes) by graph centrality measures.
Selecting Top-$n$ variables according to preferred centrality measure will
yield a strong candidate subset of variables for further learning tasks e.g.
clustering. We present our tool as a Shiny app which is a user-friendly
interface development environment. We also extend the user interface for two
well-known unsupervised variable selection methods from literature for
comparison reasons.
Related papers
- Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Contextual Feature Selection with Conditional Stochastic Gates [9.784482648233048]
Conditional Gates (c-STG) models the importance of features using conditional variables whose parameters are predicted based on contextual variables.
We show that c-STG can lead to improved feature selection capabilities while enhancing prediction accuracy and interpretability.
arXiv Detail & Related papers (2023-12-21T19:12:59Z) - A data-science pipeline to enable the Interpretability of Many-Objective
Feature Selection [0.1474723404975345]
Many-Objective Feature Selection (MOFS) approaches use four or more objectives to determine the relevance of a subset of features in a supervised learning task.
This paper proposes an original methodology to support data scientists in the interpretation and comparison of the MOFS outcome by combining post-processing and visualisation of the set of solutions.
arXiv Detail & Related papers (2023-11-30T17:44:22Z) - Statistically Valid Variable Importance Assessment through Conditional
Permutations [19.095605415846187]
Conditional Permutation Importance is a new approach to variable importance assessment.
We show that $textitCPI$ overcomes the limitations of standard permutation importance by providing accurate type-I error control.
Our results suggest that $textitCPI$ can be readily used as drop-in replacement for permutation-based methods.
arXiv Detail & Related papers (2023-09-14T10:53:36Z) - Scalable variable selection for two-view learning tasks with projection
operators [0.0]
We propose a novel variable selection method for two-view settings, or for vector-valued supervised learning problems.
Our framework is able to handle extremely large scale selection tasks, where number of data samples could be even millions.
arXiv Detail & Related papers (2023-07-04T08:22:05Z) - Temperature Schedules for Self-Supervised Contrastive Methods on
Long-Tail Data [87.77128754860983]
In this paper, we analyse the behaviour of one of the most popular variants of self-supervised learning (SSL) on long-tail data.
We find that a large $tau$ emphasises group-wise discrimination, whereas a small $tau$ leads to a higher degree of instance discrimination.
We propose to employ a dynamic $tau$ and show that a simple cosine schedule can yield significant improvements in the learnt representations.
arXiv Detail & Related papers (2023-03-23T20:37:25Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - A Lagrangian Duality Approach to Active Learning [119.36233726867992]
We consider the batch active learning problem, where only a subset of the training data is labeled.
We formulate the learning problem using constrained optimization, where each constraint bounds the performance of the model on labeled samples.
We show, via numerical experiments, that our proposed approach performs similarly to or better than state-of-the-art active learning methods.
arXiv Detail & Related papers (2022-02-08T19:18:49Z) - A Two-Stage Variable Selection Approach for Correlated High Dimensional
Predictors [4.8128078741263725]
We propose a two-stage approach that combines a variable clustering stage and a group variable stage for the group variable selection problem.
The variable clustering stage uses information from the data to find a group structure, which improves the performance of the existing group variable selection methods.
The two-stage method shows a better performance, in terms of the prediction accuracy, as well as in the accuracy to select active predictors.
arXiv Detail & Related papers (2021-03-24T17:28:34Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.