Related papers: Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

URL: http://arxiv.org/abs/2510.23127v2
Date: Thu, 30 Oct 2025 12:09:18 GMT
Title: Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs
Authors: Kai Zhuang, Jiawei Zhang, Yumou Liu, Hanqun Cao, Chunbin Gu, Mengdi Liu, Zhangyang Gao, Zitong Jerry Wang, Xuanhe Zhou, Pheng-Ann Heng, Lijun Wu, Conghui He, Cheng Tan,
Abstract summary: Sci-LLMs have emerged as a promising frontier for accelerating biological discovery.<n>Current strategies limit Sci-LLMs' reasoning capacity when processing raw biomolecular sequences.<n>We show that a more effective strategy is to provide Sci-LLMs with high-level structured context.
Score: 78.18336140706471
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis. The code is available at https://github.com/opendatalab-raiser/CoKE.

Related papers

Transcending the Annotation Bottleneck: AI-Powered Discovery in Biology and Medicine [0.0]
Self-supervised learning is currently unlocking the latent potential of biobank-scale datasets.<n>This article synthesises seminal and recent advances in "learning without labels"<n>Highlights how unsupervised frameworks can derive heritable cardiac traits, predict spatial gene expression in histology, and detect pathologies with performance that rivals or exceeds supervised counterparts.
arXiv Detail & Related papers (2026-02-23T18:15:30Z)
DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis [43.565183518761984]
We propose DOGMA, a data-centric framework designed for the structural reshaping and semantic enhancement of raw data.<n>In complex multi-species and multi-organ benchmarks, DOGMA SOTA performance, exhibiting superior zero-shot robustness and sample efficiency.
arXiv Detail & Related papers (2026-02-02T09:10:09Z)
Knowledge-Augmented Long-CoT Generation for Complex Biomolecular Reasoning [51.673503054645415]
Biomolecular mechanisms require multi-step reasoning across molecular interactions, signaling cascades, and metabolic pathways.<n>Existing approaches often exacerbate these issues: reasoning steps may deviate from biological facts or fail to capture long mechanistic dependencies.<n>We propose a Knowledge-Augmented Long-CoT Reasoning framework that integrates LLMs with knowledge graph-based multi-hop reasoning chains.
arXiv Detail & Related papers (2025-11-11T09:26:32Z)
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines [112.78540935201558]
We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations.<n>The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions.<n>It supports four capability families, covering up to 103 tasks across: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design.
arXiv Detail & Related papers (2025-09-25T17:52:06Z)
Biological Sequence with Language Model Prompting: A Survey [14.270959261105968]
Large Language models (LLMs) have emerged as powerful tools for addressing challenges across diverse domains.<n>This paper systematically investigates the application of prompt-based methods with LLMs to biological sequences.
arXiv Detail & Related papers (2025-03-06T06:28:36Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms. We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z)
BEACON: Benchmark for Comprehensive RNA Tasks and Language Models [60.02663015002029]
We introduce the first comprehensive RNA benchmark BEACON (textbfBEnchmtextbfArk for textbfCOmprehensive RtextbfNA Task and Language Models).<n>First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications.<n>Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models.<n>Third, we investigate the vital RNA language model components
arXiv Detail & Related papers (2024-06-14T19:39:19Z)
HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented Prompting [33.1455954220194]
HiPrompt is a supervision-efficient knowledge fusion framework. It elicits the few-shot reasoning ability of large language models through hierarchy-oriented prompts. Empirical results on the collected KG-Hi-BKF benchmark datasets demonstrate the effectiveness of HiPrompt.
arXiv Detail & Related papers (2023-04-12T16:54:26Z)
Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences. We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM) Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.