Evaluating LLMs for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies
- URL: http://arxiv.org/abs/2512.15312v1
- Date: Wed, 17 Dec 2025 11:02:31 GMT
- Title: Evaluating LLMs for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies
- Authors: Charan Prakash Rathore, Saumi Ray, Dhruv Kumar,
- Abstract summary: This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying Large Language Models?<n>We focus on four key subtasks: event type classification, trigger text identification, argument role extraction, and argument text extraction.<n>We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the-art LLMs.
- Score: 1.3986052226424095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domain-specific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the-art LLMs (Gemma-3-12b-it, GPT-5-mini, O4-mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90\% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65\% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79\% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve high-level understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.
Related papers
- An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs [3.925313161884993]
Word Sense Disambiguation (WSD) remains a key challenge in Natural Language Processing (NLP)<n>This study investigates whether low- parameter Large Language Models (4B parameters) can achieve comparable results through fine-tuning strategies.<n>Our results reveal that Chain-of-Thought (CoT)-based reasoning combined with neighbour-word analysis achieves performance comparable to GPT-4-Turbo in zero-shot settings.
arXiv Detail & Related papers (2026-03-05T17:27:42Z) - Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification [1.2234742322758418]
This study quantifies how prompting strategies interact with large language models (LLMs) to automate the screening stage of systematic literature reviews.<n>We evaluate six LLMs (GPT-4o, GPT-4o-mini, DeepSeek-Chat-V3, Gemini-2.5-Flash, Claude-3.5-Haiku, Llama-4-Maverick) under five prompt types.<n>CoT-few-shot yields the most reliable precision-recall balance; zero-shot maximizes recall for high-sensitivity passes; and self-reflection underperforms due to over-inclusivity and instability across models.
arXiv Detail & Related papers (2025-10-17T16:53:09Z) - LLM, Reporting In! Medical Information Extraction Across Prompting, Fine-tuning and Post-correction [6.180091953616749]
This work presents our participation in the EvalLLM 2025 challenge on biomedical Named Entity Recognition (NER) and health event extraction in French.<n>For NER, we propose three approaches combining large language models (LLMs), annotation guidelines, synthetic data, and post-processing.<n>Results show GPT-4.1 leads with a macro-F1 of 61.53% for NER and 15.02% for event extraction.
arXiv Detail & Related papers (2025-10-03T23:59:40Z) - What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction [0.3441021278275805]
This study evaluates the practical performance of three LLMs across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics.<n>We tested four distinct prompting strategies to determine how to improve extraction quality.<n> customised prompts were the most effective, boosting recall by up to 15%.
arXiv Detail & Related papers (2025-07-20T23:09:04Z) - Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction [0.0]
Speech Event Extraction (SpeechEE) is a challenging task that lies at the intersection of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP)<n>We present a modular, pipeline-based SpeechEE framework that integrates high-performance ASR with semantic search-enhanced prompting of Large Language Models (LLMs)<n>Our results demonstrate that pipeline approaches, when empowered by retrieval-augmented LLMs, can rival or exceed end-to-end systems.
arXiv Detail & Related papers (2025-04-30T07:10:10Z) - Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision [50.45597801390757]
Instruct-LF is a goal-oriented latent factor discovery system.<n>It integrates instruction-following ability with statistical models to handle noisy datasets.
arXiv Detail & Related papers (2025-02-21T02:03:08Z) - SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor.
We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z) - Comparative Study of Domain Driven Terms Extraction Using Large Language Models [0.0]
Keywords play a crucial role in bridging the gap between human understanding and machine processing of textual data.
This review focuses on keyword extraction methods, emphasizing the use of three major Large Language Models (LLMs): Llama2-7B, GPT-3.5, and Falcon-7B.
arXiv Detail & Related papers (2024-04-02T22:04:51Z) - Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields.
LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations.
We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z) - ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [51.87391234815163]
E-commerce platforms require structured product data in the form of attribute-value pairs.<n>BERT-based extraction methods require large amounts of task-specific training data.<n>This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z) - Mastering the Task of Open Information Extraction with Large Language
Models and Consistent Reasoning Environment [52.592199835286394]
Open Information Extraction (OIE) aims to extract objective structured knowledge from natural texts.
Large language models (LLMs) have exhibited remarkable in-context learning capabilities.
arXiv Detail & Related papers (2023-10-16T17:11:42Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes [54.13559879916708]
EVAPORATE is a prototype system powered by large language models (LLMs)<n>Code synthesis is cheap, but far less accurate than directly processing each document with the LLM.<n>We propose an extended code implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction.
arXiv Detail & Related papers (2023-04-19T06:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.