Vector Embeddings by Sequence Similarity and Context for Improved
Compression, Similarity Search, Clustering, Organization, and Manipulation of
cDNA Libraries
- URL: http://arxiv.org/abs/2308.05118v1
- Date: Tue, 8 Aug 2023 17:31:17 GMT
- Title: Vector Embeddings by Sequence Similarity and Context for Improved
Compression, Similarity Search, Clustering, Organization, and Manipulation of
cDNA Libraries
- Authors: Daniel H. Um, David A. Knowles, Gail E. Kaiser
- Abstract summary: This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5).
The solution lies in transforming sequences into an alternative representation that facilitates easier clustering into similar groups compared to the raw sequences themselves.
- Score: 3.162643581562756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper demonstrates the utility of organized numerical representations of
genes in research involving flat string gene formats (i.e., FASTA/FASTQ5).
FASTA/FASTQ files have several current limitations, such as their large file
sizes, slow processing speeds for mapping and alignment, and contextual
dependencies. These challenges significantly hinder investigations and tasks
that involve finding similar sequences. The solution lies in transforming
sequences into an alternative representation that facilitates easier clustering
into similar groups compared to the raw sequences themselves. By assigning a
unique vector embedding to each short sequence, it is possible to more
efficiently cluster and improve upon compression performance for the string
representations of cDNA libraries. Furthermore, through learning alternative
coordinate vector embeddings based on the contexts of codon triplets, we can
demonstrate clustering based on amino acid properties. Finally, using this
sequence embedding method to encode barcodes and cDNA sequences, we can improve
the time complexity of the similarity search by coupling vector embeddings with
an algorithm that determines the proximity of vectors in Euclidean space; this
allows us to perform sequence similarity searches in a quicker and more modular
fashion.
Related papers
- Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery [6.733319363951907]
textbfDy-mer is an explainable and robust representation scheme based on sparse recovery.
It achieves state-of-the-art performance in DNA promoter classification, yielding a remarkable textbf13% increase in accuracy.
arXiv Detail & Related papers (2024-07-06T15:08:31Z) - Generative Retrieval as Multi-Vector Dense Retrieval [71.75503049199897]
Generative retrieval generates identifiers of relevant documents in an end-to-end manner.
Prior work has demonstrated that generative retrieval with atomic identifiers is equivalent to single-vector dense retrieval.
We show that generative retrieval and multi-vector dense retrieval share the same framework for measuring the relevance to a query of a document.
arXiv Detail & Related papers (2024-03-31T13:29:43Z) - Linear normalised hash function for clustering gene sequences and
identifying reference sequences from multiple sequence alignments [4.34040512215583]
A novel method that combines the linear mapping hash function and multiple sequence alignment (MSA) was developed.
The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences.
arXiv Detail & Related papers (2023-11-29T11:51:05Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Embed-Search-Align: DNA Sequence Alignment using Transformer Models [2.48439258515764]
We present an Embed-Search-Align task for aligning DNA reads onto a human reference genome of 3 gigabases (single-haploid)
DNA-ESA is >97% accurate when aligning 250-length reads onto a human reference genome of 3 gigabases (single-haploid)
arXiv Detail & Related papers (2023-09-20T06:30:39Z) - Quick Adaptive Ternary Segmentation: An Efficient Decoding Procedure For
Hidden Markov Models [70.26374282390401]
Decoding the original signal (i.e., hidden chain) from the noisy observations is one of the main goals in nearly all HMM based data analyses.
We present Quick Adaptive Ternary (QATS), a divide-and-conquer procedure which decodes the hidden sequence in polylogarithmic computational complexity.
arXiv Detail & Related papers (2023-05-29T19:37:48Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - Tensor Representations for Action Recognition [54.710267354274194]
Human actions in sequences are characterized by the complex interplay between spatial features and their temporal dynamics.
We propose novel tensor representations for capturing higher-order relationships between visual features for the task of action recognition.
We use higher-order tensors and so-called Eigenvalue Power Normalization (NEP) which have been long speculated to perform spectral detection of higher-order occurrences.
arXiv Detail & Related papers (2020-12-28T17:27:18Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z) - Faster Activity and Data Detection in Massive Random Access: A
Multi-armed Bandit Approach [30.292176932361528]
This paper investigates the grant-free random access with massive IoT devices.
By embedding the data symbols in the signature sequences, joint device activity detection and data decoding can be achieved.
arXiv Detail & Related papers (2020-01-28T10:00:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.