Related papers: Vector Embeddings by Sequence Similarity and Context for Improved Compression, Similarity Search, Clustering, Organization, and Manipulation of cDNA Libraries

Vector Embeddings by Sequence Similarity and Context for Improved Compression, Similarity Search, Clustering, Organization, and Manipulation of cDNA Libraries

URL: http://arxiv.org/abs/2308.05118v1
Date: Tue, 8 Aug 2023 17:31:17 GMT
Title: Vector Embeddings by Sequence Similarity and Context for Improved Compression, Similarity Search, Clustering, Organization, and Manipulation of cDNA Libraries
Authors: Daniel H. Um, David A. Knowles, Gail E. Kaiser
Abstract summary: This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5). The solution lies in transforming sequences into an alternative representation that facilitates easier clustering into similar groups compared to the raw sequences themselves.
Score: 3.162643581562756
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5). FASTA/FASTQ files have several current limitations, such as their large file sizes, slow processing speeds for mapping and alignment, and contextual dependencies. These challenges significantly hinder investigations and tasks that involve finding similar sequences. The solution lies in transforming sequences into an alternative representation that facilitates easier clustering into similar groups compared to the raw sequences themselves. By assigning a unique vector embedding to each short sequence, it is possible to more efficiently cluster and improve upon compression performance for the string representations of cDNA libraries. Furthermore, through learning alternative coordinate vector embeddings based on the contexts of codon triplets, we can demonstrate clustering based on amino acid properties. Finally, using this sequence embedding method to encode barcodes and cDNA sequences, we can improve the time complexity of the similarity search by coupling vector embeddings with an algorithm that determines the proximity of vectors in Euclidean space; this allows us to perform sequence similarity searches in a quicker and more modular fashion.

Related papers

Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS and ScaNN [0.3015442485490762]
Large-scale similarity search is a foundational task in bioinformatics for detecting homology, functional similarity, and novelty among genomic and proteomic sequences.<n>We explore embedding-based similarity search methods that learn latent representations capturing deeper structural and functional patterns beyond raw sequence alignment.<n>Our results highlight both computational advantages (in memory and runtime efficiency) and improved retrieval quality, offering a promising alternative to traditional alignment-heavy tools.
arXiv Detail & Related papers (2025-07-22T19:28:54Z)
Random Cycle Coding: Lossless Compression of Cluster Assignments via Bits-Back Coding [20.815837902767072]
We present an optimal method for encoding cluster assignments of arbitrary data sets. Our method, Random Cycle Coding (RCC), encodes data sequentially and sends assignment information as cycles of the permutation defined by the order of encoded elements.
arXiv Detail & Related papers (2024-11-30T06:24:34Z)
Associative Knowledge Graphs for Efficient Sequence Storage and Retrieval [3.355436702348694]
We create associative knowledge graphs that are highly effective for storing and recognizing sequences. Individual objects (represented as nodes) can be a part of multiple sequences or appear repeatedly within a single sequence. This approach has potential applications in diverse fields, such as anomaly detection in financial transactions or predicting user behavior based on past actions.
arXiv Detail & Related papers (2024-11-19T13:00:31Z)
Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery [6.733319363951907]
textbfDy-mer is an explainable and robust representation scheme based on sparse recovery. It achieves state-of-the-art performance in DNA promoter classification, yielding a remarkable textbf13% increase in accuracy.
arXiv Detail & Related papers (2024-07-06T15:08:31Z)
Generative Retrieval as Multi-Vector Dense Retrieval [71.75503049199897]
Generative retrieval generates identifiers of relevant documents in an end-to-end manner. Prior work has demonstrated that generative retrieval with atomic identifiers is equivalent to single-vector dense retrieval. We show that generative retrieval and multi-vector dense retrieval share the same framework for measuring the relevance to a query of a document.
arXiv Detail & Related papers (2024-03-31T13:29:43Z)
Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments [4.34040512215583]
A novel method that combines the linear mapping hash function and multiple sequence alignment (MSA) was developed. The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences.
arXiv Detail & Related papers (2023-11-29T11:51:05Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Embed-Search-Align: DNA Sequence Alignment using Transformer Models [2.48439258515764]
We bridge the gap by framing the sequence alignment task for Transformer models as an "Embed-Search-Align" task. A novel Reference-Free DNA Embedding model generates embeddings of reads and reference fragments, which are projected into a shared vector space. DNA-ESA is 99% accurate when aligning 250-length reads onto a human genome (3gb), rivaling conventional methods such as Bowtie and BWA-Mem.
arXiv Detail & Related papers (2023-09-20T06:30:39Z)
Quick Adaptive Ternary Segmentation: An Efficient Decoding Procedure For Hidden Markov Models [70.26374282390401]
Decoding the original signal (i.e., hidden chain) from the noisy observations is one of the main goals in nearly all HMM based data analyses. We present Quick Adaptive Ternary (QATS), a divide-and-conquer procedure which decodes the hidden sequence in polylogarithmic computational complexity.
arXiv Detail & Related papers (2023-05-29T19:37:48Z)
Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences. Exact methods yield better classification performance, but they pose high computational costs. We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z)
Tensor Representations for Action Recognition [54.710267354274194]
Human actions in sequences are characterized by the complex interplay between spatial features and their temporal dynamics. We propose novel tensor representations for capturing higher-order relationships between visual features for the task of action recognition. We use higher-order tensors and so-called Eigenvalue Power Normalization (NEP) which have been long speculated to perform spectral detection of higher-order occurrences.
arXiv Detail & Related papers (2020-12-28T17:27:18Z)
New advances in enumerative biclustering algorithms with online partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets. The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.