Recent advances in the Self-Referencing Embedding Strings (SELFIES)
library
- URL: http://arxiv.org/abs/2302.03620v1
- Date: Tue, 7 Feb 2023 17:24:08 GMT
- Title: Recent advances in the Self-Referencing Embedding Strings (SELFIES)
library
- Authors: Alston Lo, Robert Pollice, AkshatKumar Nigam, Andrew D. White, Mario
Krenn and Al\'an Aspuru-Guzik
- Abstract summary: String-based molecular representations play a crucial role in cheminformatics applications.
Traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models.
SELF-referencIng Embedded Strings (SELFIES) was proposed that is inherently 100% robust, alongside an accompanying open-source implementation.
- Score: 1.9573380763700712
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: String-based molecular representations play a crucial role in cheminformatics
applications, and with the growing success of deep learning in chemistry, have
been readily adopted into machine learning pipelines. However, traditional
string-based representations such as SMILES are often prone to syntactic and
semantic errors when produced by generative models. To address these problems,
a novel representation, SELF-referencIng Embedded Strings (SELFIES), was
proposed that is inherently 100% robust, alongside an accompanying open-source
implementation. Since then, we have generalized SELFIES to support a wider
range of molecules and semantic constraints and streamlined its underlying
grammar. We have implemented this updated representation in subsequent versions
of \selfieslib, where we have also made major advances with respect to design,
efficiency, and supported features. Hence, we present the current status of
\selfieslib (version 2.1.1) in this manuscript.
Related papers
- Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations [0.0]
We construct Transformer models where the embedding layer is entirely frozen.<n>Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer.<n>Despite the absence of trainable, semantically embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings.
arXiv Detail & Related papers (2025-07-07T11:17:32Z) - Type-Constrained Code Generation with Language Models [51.03439021895432]
Large language models (LLMs) produce uncompilable output because their next-token inference procedure does not model formal aspects of code.
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.
Our approach reduces compilation errors by more than half and increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z) - Language Models can Self-Lengthen to Generate Long Texts [74.96074422345806]
This paper introduces an innovative iterative training framework called Self-Lengthen.
It leverages only the intrinsic knowledge and skills of Large Language Models without the need for auxiliary data or proprietary models.
Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation.
arXiv Detail & Related papers (2024-10-31T13:47:10Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Efficient Guided Generation for Large Language Models [0.21485350418225244]
We show how the problem of neural text generation can be constructively reformulated in terms of transitions between the states of a finite-state machine.
This framework leads to an efficient approach to guiding text generation with regular expressions and context-free grammars.
arXiv Detail & Related papers (2023-07-19T01:14:49Z) - Improving Zero-Shot Generalization for CLIP with Synthesized Prompts [135.4317555866831]
Most existing methods require labeled data for all classes, which may not hold in real-world applications.
We propose a plug-and-play generative approach called textbfSynttextbfHestextbfIzed textbfPrompts(textbfSHIP) to improve existing fine-tuning methods.
arXiv Detail & Related papers (2023-07-14T15:15:45Z) - Automatic Context Pattern Generation for Entity Set Expansion [40.535332689515656]
We develop a module that automatically generates high-quality context patterns for entities.
We also propose the GAPA framework that leverages the aforementioned GenerAted PAtterns to expand target entities.
arXiv Detail & Related papers (2022-07-17T06:50:35Z) - On Adversarial Robustness of Synthetic Code Generation [1.2559148369195197]
This paper showcases the existence of significant dataset bias through different classes of adversarial examples.
We propose several dataset augmentation techniques to reduce bias and showcase their efficacy.
arXiv Detail & Related papers (2021-06-22T09:37:48Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.