Related papers: Informative Initialization and Kernel Selection Improves t-SNE for Biological Sequences

Informative Initialization and Kernel Selection Improves t-SNE for Biological Sequences

URL: http://arxiv.org/abs/2211.09263v1
Date: Wed, 16 Nov 2022 23:36:27 GMT
Title: Informative Initialization and Kernel Selection Improves t-SNE for Biological Sequences
Authors: Prakash Chourasia, Sarwan Ali, Murray Patterson
Abstract summary: The t-distributed neighbor embedding (t-SNE) is a method for interpreting high dimensional (HD) data by mapping each point to a low dimensional (LD) space. We show that kernel selection can play a crucial role in the performance of t-SNE.
Score: 0.966840768820136
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The t-distributed stochastic neighbor embedding (t- SNE) is a method for interpreting high dimensional (HD) data by mapping each point to a low dimensional (LD) space (usually two-dimensional). It seeks to retain the structure of the data. An important component of the t-SNE algorithm is the initialization procedure, which begins with the random initialization of an LD vector. Points in this initial vector are then updated to minimize the loss function (the KL divergence) iteratively using gradient descent. This leads comparable points to attract one another while pushing dissimilar points apart. We believe that, by default, these algorithms should employ some form of informative initialization. Another essential component of the t-SNE is using a kernel matrix, a similarity matrix comprising the pairwise distances among the sequences. For t-SNE-based visualization, the Gaussian kernel is employed by default in the literature. However, we show that kernel selection can also play a crucial role in the performance of t-SNE. In this work, we assess the performance of t-SNE with various alternative initialization methods and kernels, using four different sets, out of which three are biological sequences (nucleotide, protein, etc.) datasets obtained from various sources, such as the well-known GISAID database for sequences of the SARS- CoV-2 virus. We perform subjective and objective assessments of these alternatives. We use the resulting t-SNE plots and k- ary neighborhood agreement (k-ANA) to evaluate and compare the proposed methods with the baselines. We show that by using different techniques, such as informed initialization and kernel matrix selection, that t-SNE performs significantly better. Moreover, we show that t-SNE also takes fewer iterations to converge faster with more intelligent initialization.

Related papers

MIK: Modified Isolation Kernel for Biological Sequence Visualization, Classification, and Clustering [3.9146761527401424]
This research proposes a novel approach called the Modified Isolation Kernel (MIK) as an alternative to the Gaussian kernel. MIK uses adaptive density estimation to capture local structures more accurately and integrates robustness measures. It exhibits improved preservation of the local and global structure and enables better visualization of clusters and subclusters in the embedded space.
arXiv Detail & Related papers (2024-10-21T06:57:09Z)
Learning nonparametric DAGs with incremental information via high-order HSIC [13.061477915002767]
We present an identifiability condition based on a determined subset of parents to identify the underlying DAG. In the optimal phase, an optimization problem based on first-order Hilbert-optimal independence criterion (HSIC) gives an estimated skeleton as the initial determined parents subset. In the tuning phase, the skeleton is locally tuned by deletion, addition and DAG-formalization strategies.
arXiv Detail & Related papers (2023-08-11T07:07:21Z)
Kernel t-distributed stochastic neighbor embedding [6.107978190324034]
This paper presents a kernelized version of the t-SNE algorithm. It is capable of mapping high-dimensional data to a low-dimensional space while preserving the pairwise distances between the data points in a non-Euclidean metric.
arXiv Detail & Related papers (2023-07-13T22:23:05Z)
Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks [51.8723187709964]
We study the OOD generalization of neural algorithmic reasoning tasks. The goal is to learn an algorithm from input-output pairs using deep neural networks.
arXiv Detail & Related papers (2022-11-01T18:33:20Z)
Joint Embedding Self-Supervised Learning in the Kernel Regime [21.80241600638596]
Self-supervised learning (SSL) produces useful representations of data without access to any labels for classifying the data. We extend this framework to incorporate algorithms based on kernel methods where embeddings are constructed by linear maps acting on the feature space of a kernel. We analyze our kernel model on small datasets to identify common features of self-supervised learning algorithms and gain theoretical insights into their performance on downstream tasks.
arXiv Detail & Related papers (2022-09-29T15:53:19Z)
Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences. Exact methods yield better classification performance, but they pose high computational costs. We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z)
Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets. We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels. We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z)
Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network. We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z)
Self-supervised Symmetric Nonnegative Matrix Factorization [82.59905231819685]
Symmetric nonnegative factor matrix (SNMF) has demonstrated to be a powerful method for data clustering. Inspired by ensemble clustering that aims to seek better clustering results, we propose self-supervised SNMF (S$3$NMF) We take advantage of the sensitivity to code characteristic of SNMF, without relying on any additional information.
arXiv Detail & Related papers (2021-03-02T12:47:40Z)
Optimal Randomized First-Order Methods for Least-Squares Problems [56.05635751529922]
This class of algorithms encompasses several randomized methods among the fastest solvers for least-squares problems. We focus on two classical embeddings, namely, Gaussian projections and subsampled Hadamard transforms. Our resulting algorithm yields the best complexity known for solving least-squares problems with no condition number dependence.
arXiv Detail & Related papers (2020-02-21T17:45:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.