From Static Structures to Ensembles: Studying and Harnessing Protein Structure Tokenization
- URL: http://arxiv.org/abs/2511.10056v1
- Date: Fri, 14 Nov 2025 01:29:22 GMT
- Title: From Static Structures to Ensembles: Studying and Harnessing Protein Structure Tokenization
- Authors: Zijing Liu, Bin Feng, He Cao, Yu Li,
- Abstract summary: Protein structure tokenization converts 3D structures into discrete or vectorized representations.<n>Despite many recent works on structure tokenization, the properties of the underlying discrete representations are not well understood.<n>We show that the successful utilization of structural tokens in a language model for structure prediction depends on using rich, pre-trained sequence embeddings.
- Score: 15.864659611818661
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protein structure tokenization converts 3D structures into discrete or vectorized representations, enabling the integration of structural and sequence data. Despite many recent works on structure tokenization, the properties of the underlying discrete representations are not well understood. In this work, we first demonstrate that the successful utilization of structural tokens in a language model for structure prediction depends on using rich, pre-trained sequence embeddings to bridge the semantic gap between the sequence and structural "language". The analysis of the structural vocabulary itself then reveals significant semantic redundancy, where multiple distinct tokens correspond to nearly identical local geometries, acting as "structural synonyms". This redundancy, rather than being a flaw, can be exploited with a simple "synonym swap" strategy to generate diverse conformational ensembles by perturbing a predicted structure with its structural synonyms. This computationally lightweight method accurately recapitulates protein flexibility, performing competitively with state-of-the-art models. Our study provides fundamental insights into the nature of discrete protein structure representations and introduces a powerful, near-instantaneous method for modeling protein dynamics. Source code is available in https://github.com/IDEA-XL/TokenMD.
Related papers
- Discovering Semantic Latent Structures in Psychological Scales: A Response-Free Pathway to Efficient Simplification [7.405170407676887]
We introduce a topic-modeling framework that operationalizes semantic latent structure for scale simplification.<n>Items are encoded using contextual sentence embeddings and grouped via density-based clustering.<n>We benchmarked the framework across DASS, IPIP, and EPOCH, evaluating structural recovery, internal consistency, factor congruence, correlation preservation, and reduction efficiency.
arXiv Detail & Related papers (2026-02-13T03:37:15Z) - A Study of Adaptive Modeling Towards Robust Generalization [14.00955228748485]
We present a unified all-atom framework that grounds language reasoning in geometric information while adaptively scaling structural tokens.<n>Across diverse all-atom benchmarks, the proposed approach yields consistent gains in heterogeneous structure-grounded reasoning.
arXiv Detail & Related papers (2026-02-02T20:35:44Z) - Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models [8.87728727154868]
We propose a set of probability signatures that reflect the semantic relationships among tokens.<n>We generalize our work to large language models (LLMs) by training the Qwen2.5 architecture on the subsets of the Pile corpus.
arXiv Detail & Related papers (2025-09-24T13:49:44Z) - StructCoh: Structured Contrastive Learning for Context-Aware Text Semantic Matching [10.000850856259866]
StructCoh is a graph-enhanced contrastive learning framework.<n>A hierarchical contrastive objective enforces consistency at multiple granularities.<n>Experiments on three legal document matching benchmarks and academic plagiarism detection datasets demonstrate significant improvements.
arXiv Detail & Related papers (2025-09-02T07:21:36Z) - FoldToken2: Learning compact, invariant and generative protein structure language [48.1647245005672]
We propose FoldToken2 to transfer equivariant structures into discrete tokens, while maintaining the recoverability of the original structures.
We evaluate FoldToken2 on the protein structure reconstruction task and show that it outperforms previous FoldToken1 by 20% in TMScore and 81% in RMSD.
We believe that FoldToken2 will inspire further improvement in protein structure representation learning, structure alignment, and structure generation tasks.
arXiv Detail & Related papers (2024-06-11T09:24:51Z) - Large Language Model-driven Meta-structure Discovery in Heterogeneous Information Network [29.149367323751413]
We propose ReStruct, a meta-structure search framework that integrates reasoning into the evolutionary procedure.
We show that ReStruct achieves state-of-the-art performance in both recommendation and node classification tasks.
arXiv Detail & Related papers (2024-02-18T09:21:12Z) - FoldToken: Learning Protein Language via Vector Quantization and Beyond [56.19308144551836]
We introduce textbfFoldTokenizer to represent protein sequence-structure as discrete symbols.
We refer to the learned symbols as textbfFoldToken, and the sequence of FoldTokens serves as a new protein language.
arXiv Detail & Related papers (2024-02-04T12:18:51Z) - StructRe: Rewriting for Structured Shape Modeling [60.20359722058389]
We present StructRe, a structure rewriting system, as a novel approach to structured shape modeling.<n>Given a 3D object represented by points and components, StructRe can rewrite it upward into more concise structures, or downward into more detailed structures.
arXiv Detail & Related papers (2023-11-29T10:35:00Z) - StrAE: Autoencoding for Pre-Trained Embeddings using Explicit Structure [5.2869308707704255]
StrAE is a Structured Autoencoder framework that through strict adherence to explicit structure, enables effective learning of multi-level representations.
We show that our results are directly attributable to the informativeness of the structure provided as input, and show that this is not the case for existing tree models.
We then extend StrAE to allow the model to define its own compositions using a simple localised-merge algorithm.
arXiv Detail & Related papers (2023-05-09T16:20:48Z) - How Do Transformers Learn Topic Structure: Towards a Mechanistic
Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure"
We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z) - Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z) - Unsupervised Distillation of Syntactic Information from Contextualized
Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations.
To this end, we automatically generate groups of sentences which are structurally similar but semantically different.
We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.