Related papers: Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

URL: http://arxiv.org/abs/2412.14642v3
Date: Mon, 15 Sep 2025 17:29:42 GMT
Title: Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation
Authors: Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Dongzhan Zhou, Xiao-yong Wei, Qing Li,
Abstract summary: We propose Speak-to-Structure (S2-Bench), the first benchmark to evaluate Large Language Models (LLMs) in open-domain natural language-driven molecule generation.<n>Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom)<n>We also introduce OpenMolIns, a large-scale instruction tuning dataset that enables Llama-3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S2-Bench.
Score: 26.166926881479316
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recently, Large Language Models (LLMs) have shown great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on a one-to-one mapping, measuring LLMs' ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates. To address this critical gap, we propose Speak-to-Structure (S^2-Bench}), the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation. S^2-Bench is specifically designed for one-to-many relationships, challenging LLMs to demonstrate genuine molecular understanding and generation capabilities. Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom), each probing a different aspect of molecule discovery. We also introduce OpenMolIns, a large-scale instruction tuning dataset that enables Llama-3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S^2-Bench. Our comprehensive evaluation of 28 LLMs shifts the focus from simple pattern recall to realistic molecular design, paving the way for more capable LLMs in natural language-driven molecule discovery.

Related papers

How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning? [51.286853421822705]
Large language models (LLMs) have shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear.<n>We introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures.<n>Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions.
arXiv Detail & Related papers (2026-01-09T20:08:42Z)
MolEdit: Knowledge Editing for Multimodal Molecule Language Models [57.85765246726558]
MolEdit is a framework for molecule-to-caption generation and caption-to-molecule generation.<n>MolEdit combines a Multi-Expert Knowledge Adapter that routes edits to specialized experts for different molecular facets with an Expertise-Aware Editing Switcher.<n>MolEdit delivers up to 18.8% higher Reliability and 12.0% better Locality than baselines while maintaining efficiency.
arXiv Detail & Related papers (2025-11-16T20:48:37Z)
$\ ext{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models [59.125833618091846]
We propose a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view.<n>Experiments demonstrate that $textM2$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks.
arXiv Detail & Related papers (2025-08-12T05:46:47Z)
Large Language Model Agent for Modular Task Execution in Drug Discovery [7.1616715247845955]
We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early-stage computational drug discovery pipeline.<n>By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, domain-specific question answering, molecular generation, property prediction, property-aware molecular refinement, and 3D protein-ligand structure generation.
arXiv Detail & Related papers (2025-06-26T00:19:01Z)
ChemMLLM: Chemical Multimodal Large Language Model [52.95382215206681]
We propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation.<n>Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets.<n> Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks.
arXiv Detail & Related papers (2025-05-22T07:32:17Z)
A Survey of Large Language Models for Text-Guided Molecular Discovery: from Molecule Generation to Optimization [20.160910256604726]
Large language models (LLMs) are introducing a paradigm shift in molecular discovery.<n>This survey provides an up-to-date review of the emerging use of LLMs for two central tasks: molecule generation and molecule optimization.
arXiv Detail & Related papers (2025-05-22T00:26:27Z)
OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution [4.742123770879715]
Open Large Language Models (OLLMs) are increasingly leveraged in generative AI applications. We propose OpenTuringBench, a new benchmark based on OLLMs to train and evaluate machine-generated text detectors.
arXiv Detail & Related papers (2025-04-15T16:36:14Z)
OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z)
Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models [43.37148291436855]
We present a two-step framework PEIT to improve large language models for molecular-related tasks. In the first step, we use textual descriptions, SMILES, and biochemical properties as multimodal inputs to pre-train a model called PEIT-GEN. In the second step, we fine-tune existing open-source LLMs with the synthesized data, the resulting PEIT-LLM can handle molecule captioning, text-based molecule generation, molecular property prediction, and our newly proposed multi-constraint molecule generation tasks.
arXiv Detail & Related papers (2024-12-24T01:48:07Z)
MolCap-Arena: A Comprehensive Captioning Benchmark on Language-Enhanced Molecular Property Prediction [44.27112553103388]
We present Molecule Caption Arena: the first comprehensive benchmark of large language models (LLMs)augmented molecular property prediction. We evaluate over twenty LLMs, including both general-purpose and domain-specific molecule captioners, across diverse prediction tasks. Our findings confirm the ability of LLM-extracted knowledge to enhance state-of-the-art molecular representations.
arXiv Detail & Related papers (2024-11-01T17:03:16Z)
Many-Shot In-Context Learning for Molecular Inverse Design [56.65345962071059]
Large Language Models (LLMs) have demonstrated great performance in few-shot In-Context Learning (ICL) We develop a new semi-supervised learning method that overcomes the lack of experimental data available for many-shot ICL. As we show, the new method greatly improves upon existing ICL methods for molecular design while being accessible and easy to use for scientists.
arXiv Detail & Related papers (2024-07-26T21:10:50Z)
MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension [34.586861881519134]
Large Language Models (LLMs) with their strong task-handling capabilities have shown remarkable advancements across a spectrum of fields. This study seeks to enhance the ability of LLMs to comprehend molecules by equipping them with a multi-modal external module, namely MolX. In particular, instead of directly using a SMILES string to represent a molecule, we utilize specific encoders to extract fine-grained features from both SMILES string and 2D molecular graph representations.
arXiv Detail & Related papers (2024-06-10T20:25:18Z)
RAG-Enhanced Commit Message Generation [8.858678357308726]
Commit Message Generation has become a research hotspot. It is time-consuming to write commit messages manually. This paper proposes REACT, a REtrieval-Augmented framework for CommiT message generation.
arXiv Detail & Related papers (2024-06-08T16:24:24Z)
Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers' We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z)
FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability [70.84333325049123]
FoFo is a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats. This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
arXiv Detail & Related papers (2024-02-28T19:23:27Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
LLM4VV: Developing LLM-Driven Testsuite for Compiler Validation [7.979116939578324]
Large language models (LLMs) are a powerful tool for a wide span of applications involving natural language. We explore the capabilities of state-of-the-art LLMs, including open-source LLMs -- Meta Codellama, Phind fine-tuned version of Codellama, Deepseek Deepseek Coder and closed-source LLMs -- OpenAI GPT-3.5-Turbo and GPT-4-Turbo.
arXiv Detail & Related papers (2023-10-08T01:43:39Z)
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? [49.688233418425995]
Struc-Bench is a comprehensive benchmark featuring prominent Large Language Models (LLMs) We propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score) Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains.
arXiv Detail & Related papers (2023-09-16T11:31:58Z)
Can Large Language Models Empower Molecular Property Prediction? [16.5246941211725]
Molecular property prediction has gained significant attention due to its transformative potential in scientific disciplines. Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP. In this work, we advance towards this objective through two perspectives: zero/few-shot molecular classification, and using the new explanations generated by LLMs as representations of molecules.
arXiv Detail & Related papers (2023-07-14T16:06:42Z)
Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks. In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation. We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.