ADRS-CNet: An adaptive dimensionality reduction selection and classification network for DNA storage clustering algorithms
- URL: http://arxiv.org/abs/2408.12751v2
- Date: Sun, 22 Sep 2024 03:10:02 GMT
- Title: ADRS-CNet: An adaptive dimensionality reduction selection and classification network for DNA storage clustering algorithms
- Authors: Bowen Liu, Jiankun Li,
- Abstract summary: Methods like PCA, UMAP, and t-SNE are commonly employed to project high-dimensional features into low-dimensional space.
This paper proposes training a multilayer perceptron model to classify input DNA sequence features and adaptively select the most suitable dimensionality reduction method.
- Score: 8.295062627879938
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: DNA storage technology offers new possibilities for addressing massive data storage due to its high storage density, long-term preservation, low maintenance cost, and compact size. To improve the reliability of stored information, base errors and missing storage sequences are challenges that must be faced. Currently, clustering and comparison of sequenced sequences are employed to recover the original sequence information as much as possible. Nonetheless, extracting DNA sequences of different lengths as features leads to the curse of dimensionality, which needs to be overcome. To address this, techniques like PCA, UMAP, and t-SNE are commonly employed to project high-dimensional features into low-dimensional space. Considering that these methods exhibit varying effectiveness in dimensionality reduction when dealing with different datasets, this paper proposes training a multilayer perceptron model to classify input DNA sequence features and adaptively select the most suitable dimensionality reduction method to enhance subsequent clustering results. Through testing on open-source datasets and comparing our approach with various baseline methods, experimental results demonstrate that our model exhibits superior classification performance and significantly improves clustering outcomes. This displays that our approach effectively mitigates the impact of the curse of dimensionality on clustering models.
Related papers
- Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - MGAS: Multi-Granularity Architecture Search for Trade-Off Between Model
Effectiveness and Efficiency [10.641875933652647]
We introduce multi-granularity architecture search (MGAS) to discover both effective and efficient neural networks.
We learn discretization functions specific to each granularity level to adaptively determine the unit remaining ratio according to the evolving architecture.
Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet demonstrate that MGAS outperforms other state-of-the-art methods in achieving a better trade-off between model performance and model size.
arXiv Detail & Related papers (2023-10-23T16:32:18Z) - Implicit Neural Multiple Description for DNA-based data storage [6.423239719448169]
DNA exhibits remarkable potential as a data storage solution due to its impressive storage density and long-term stability.
However, developing this novel medium comes with its own set of challenges, particularly in addressing errors arising from storage and biological manipulations.
We have pioneered a novel compression scheme and a cutting-edge Multiple Description Coding (MDC) technique utilizing neural networks for DNA data storage.
arXiv Detail & Related papers (2023-09-13T13:42:52Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - SparCA: Sparse Compressed Agglomeration for Feature Extraction and
Dimensionality Reduction [0.0]
We propose sparse compressed agglomeration (SparCA) as a novel dimensionality reduction procedure.
SparCA is applicable to a wide range of data types, produces highly interpretable features, and shows compelling performance on downstream supervised learning tasks.
arXiv Detail & Related papers (2023-01-26T13:59:15Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - Distributed Dynamic Safe Screening Algorithms for Sparse Regularization [73.85961005970222]
We propose a new distributed dynamic safe screening (DDSS) method for sparsity regularized models and apply it on shared-memory and distributed-memory architecture respectively.
We prove that the proposed method achieves the linear convergence rate with lower overall complexity and can eliminate almost all the inactive features in a finite number of iterations almost surely.
arXiv Detail & Related papers (2022-04-23T02:45:55Z) - Efficient Cluster-Based k-Nearest-Neighbor Machine Translation [65.69742565855395]
k-Nearest-Neighbor Machine Translation (kNN-MT) has been recently proposed as a non-parametric solution for domain adaptation in neural machine translation (NMT)
arXiv Detail & Related papers (2022-04-13T05:46:31Z) - Hybridization of Capsule and LSTM Networks for unsupervised anomaly
detection on multivariate data [0.0]
This paper introduces a novel NN architecture which hybridises the Long-Short-Term-Memory (LSTM) and Capsule Networks into a single network.
The proposed method uses an unsupervised learning technique to overcome the issues with finding large volumes of labelled training data.
arXiv Detail & Related papers (2022-02-11T10:33:53Z) - Consistency and Diversity induced Human Motion Segmentation [231.36289425663702]
We propose a novel Consistency and Diversity induced human Motion (CDMS) algorithm.
Our model factorizes the source and target data into distinct multi-layer feature spaces.
A multi-mutual learning strategy is carried out to reduce the domain gap between the source and target data.
arXiv Detail & Related papers (2022-02-10T06:23:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.