Symbol-based entity marker highlighting for enhanced text mining in materials science with generative AI
- URL: http://arxiv.org/abs/2505.05864v1
- Date: Fri, 09 May 2025 07:58:30 GMT
- Title: Symbol-based entity marker highlighting for enhanced text mining in materials science with generative AI
- Authors: Junhyeong Lee, Jong Min Yuk, Chan-Woo Lee,
- Abstract summary: We propose a novel hybrid text-mining framework to convert unstructured scientific text into structured data.<n>Our approach first transforms raw text into entity-recognized text, and subsequently into structured form.<n>We also enhance entity recognition performance by introducing an entity marker-a simple yet effective technique that uses symbolic annotations.
- Score: 4.178382980763478
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The construction of experimental datasets is essential for expanding the scope of data-driven scientific discovery. Recent advances in natural language processing (NLP) have facilitated automatic extraction of structured data from unstructured scientific literature. While existing approaches-multi-step and direct methods-offer valuable capabilities, they also come with limitations when applied independently. Here, we propose a novel hybrid text-mining framework that integrates the advantages of both methods to convert unstructured scientific text into structured data. Our approach first transforms raw text into entity-recognized text, and subsequently into structured form. Furthermore, beyond the overall data structuring framework, we also enhance entity recognition performance by introducing an entity marker-a simple yet effective technique that uses symbolic annotations to highlight target entities. Specifically, our entity marker-based hybrid approach not only consistently outperforms previous entity recognition approaches across three benchmark datasets (MatScholar, SOFC, and SOFC slot NER) but also improve the quality of final structured data-yielding up to a 58% improvement in entity-level F1 score and up to 83% improvement in relation-level F1 score compared to direct approach.
Related papers
- Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions [15.97570754056266]
We propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs)<n>Our method produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus.<n>We also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents.
arXiv Detail & Related papers (2026-02-20T00:12:04Z) - SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP [3.806421007129287]
SciNLP is a benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain.<n>The dataset comprises 60 manually annotated full-text NLP publications, covering 7,072 entities and 1,826 relations.
arXiv Detail & Related papers (2025-09-09T14:41:40Z) - Enhancing Abstractive Summarization of Scientific Papers Using Structure Information [6.414732533433283]
We propose a two-stage abstractive summarization framework that leverages automatic recognition of structural functions within scientific papers.<n>In the first stage, we standardize chapter titles from numerous scientific papers and construct a large-scale dataset for structural function recognition.<n>In the second stage, we employ Longformer to capture rich contextual relationships across sections and generating context-aware summaries.
arXiv Detail & Related papers (2025-05-20T10:34:45Z) - Integrating Textual Embeddings from Contrastive Learning with Generative Recommender for Enhanced Personalization [8.466223794246261]
We propose a hybrid framework that augments the generative recommender with contrastive text embedding model.<n>We evaluate our method on two domains from the Amazon Reviews 2023 dataset.
arXiv Detail & Related papers (2025-04-13T15:23:00Z) - ORIGAMI: A generative transformer architecture for predictions from semi-structured data [3.5639148953570836]
ORIGAMI is a transformer-based architecture that processes nested key/value pairs.<n>By reformulating classification as next-token prediction, ORIGAMI naturally handles both single-label and multi-label tasks.
arXiv Detail & Related papers (2024-12-23T07:21:17Z) - Value Alignment from Unstructured Text [32.9140028463247]
We introduce a systematic end-to-end methodology for aligning large language models (LLMs) to the implicit and explicit values represented in unstructured text data.
Our proposed approach leverages the use of scalable synthetic data generation techniques to effectively align the model to the values present in the unstructured data.
Our approach credibly aligns LLMs to the values embedded within documents, and shows improved performance against other approaches.
arXiv Detail & Related papers (2024-08-19T20:22:08Z) - Accelerated materials language processing enabled by GPT [5.518792725397679]
We develop generative transformer (GPT)-enabled pipelines for materials language processing.
First, we develop a GPT-enabled document classification method for screening relevant documents.
Secondly, for NER task, we design an entity-centric prompts, and learning few-shot of them improved the performance.
Finally, we develop an GPT-enabled extractive QA model, which provides improved performance and shows the possibility of automatically correcting annotations.
arXiv Detail & Related papers (2023-08-18T07:31:13Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Nested Named Entity Recognition as Holistic Structure Parsing [92.8397338250383]
This work models the full nested NEs in a sentence as a holistic structure, then we propose a holistic structure parsing algorithm to disclose the entire NEs once for all.
Experiments show that our model yields promising results on widely-used benchmarks which approach or even achieve state-of-the-art.
arXiv Detail & Related papers (2022-04-17T12:48:20Z) - Modeling Multi-Granularity Hierarchical Features for Relation Extraction [26.852869800344813]
We propose a novel method to extract multi-granularity features based solely on the original input sentences.
We show that effective structured features can be attained even without external knowledge.
arXiv Detail & Related papers (2022-04-09T09:44:05Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.