From Sentences to Sequences: Rethinking Languages in Biological System
- URL: http://arxiv.org/abs/2507.00953v2
- Date: Thu, 03 Jul 2025 10:33:16 GMT
- Title: From Sentences to Sequences: Rethinking Languages in Biological System
- Authors: Ke Liu, Shuaike Shen, Hao Chen,
- Abstract summary: We revisit the notion of language in biological systems to better understand how NLP successes can be effectively translated to biological domains.<n>By treating the 3D structure of biomolecules as the semantic content of a sentence, we highlight the importance of structural evaluation.
- Score: 6.304152224988003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The paradigm of large language models in natural language processing (NLP) has also shown promise in modeling biological languages, including proteins, RNA, and DNA. Both the auto-regressive generation paradigm and evaluation metrics have been transferred from NLP to biological sequence modeling. However, the intrinsic structural correlations in natural and biological languages differ fundamentally. Therefore, we revisit the notion of language in biological systems to better understand how NLP successes can be effectively translated to biological domains. By treating the 3D structure of biomolecules as the semantic content of a sentence and accounting for the strong correlations between residues or bases, we highlight the importance of structural evaluation and demonstrate the applicability of the auto-regressive paradigm in biological language modeling. Code can be found at \href{https://github.com/zjuKeLiu/RiFold}{github.com/zjuKeLiu/RiFold}
Related papers
- BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects [14.172782866715844]
Large language models (LLMs) trained on text demonstrated remarkable results on natural language processing (NLP) tasks.<n>DNA differs fundamentally from natural language, as it lacks clearly defined words or a consistent grammar.<n>We pre-train foundation models that effectively integrate sequence variations, in particular Single Nucleotide Polymorphisms (SNPs)<n>Our findings indicate that integrating sequence variations into DNALMs helps capture the biological functions as seen in improvements on all fine-tuning tasks.
arXiv Detail & Related papers (2025-06-26T13:56:32Z) - Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification [55.98854157265578]
Life-Code is a comprehensive framework that spans different biological functions.<n>We propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences.<n>Life-Code achieves state-of-the-art results on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.
arXiv Detail & Related papers (2025-02-11T06:53:59Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset.<n>This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks.<n>We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z) - Can linguists better understand DNA? [0.0]
This study addresses the existence of capability transfer between natural language and gene sequences/languages.<n>We constructed two analogous tasks: DNA-pair classification(DNA sequence similarity) and DNA-protein-pair classification(gene coding determination)<n>These tasks were designed to validate the transferability of capabilities from natural language to gene sequences.
arXiv Detail & Related papers (2024-12-10T17:06:33Z) - Morphological Typology in BPE Subword Productivity and Language Modeling [0.0]
We focus on languages with synthetic and analytical morphological structures and examine their productivity when tokenized.
Experiments reveal that languages with synthetic features exhibit greater subword regularity and productivity with BPE tokenization.
arXiv Detail & Related papers (2024-10-31T06:13:29Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - ImmunoLingo: Linguistics-based formalization of the antibody language [0.5412332666265471]
Apparent parallels between natural language and biological sequence have led to a surge in the application of deep language models (LMs)
A lack of a rigorous linguistic formalization of biological sequence languages has led to largely domain-unspecific applications of LMs.
A linguistic formalization establishes linguistically-informed and thus domain-adapted components for LM applications.
arXiv Detail & Related papers (2022-09-26T12:33:14Z) - Learning Music Helps You Read: Using Transfer to Study Linguistic
Structure in Language Models [27.91397366776451]
Training LSTMs on latent structure (MIDI music or Java code) improves test performance on natural language.
Experiments on transfer between natural languages controlling for vocabulary overlap show that zero-shot performance on a test language is highly correlated with typological similarity to the training language.
arXiv Detail & Related papers (2020-04-30T06:24:03Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.