Extracting Polymer Nanocomposite Samples from Full-Length Documents
- URL: http://arxiv.org/abs/2403.00260v1
- Date: Fri, 1 Mar 2024 03:51:56 GMT
- Title: Extracting Polymer Nanocomposite Samples from Full-Length Documents
- Authors: Ghazal Khalighinejad, Defne Circi, L.C. Brinson, Bhuwan Dhingra
- Abstract summary: This paper investigates the use of large language models (LLMs) for extracting sample lists of polymer nanocomposites (PNCs) from full-length materials science research papers.
The challenge lies in the complex nature of PNC samples, which have numerous attributes scattered throughout the text.
- Score: 6.25070848511355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates the use of large language models (LLMs) for
extracting sample lists of polymer nanocomposites (PNCs) from full-length
materials science research papers. The challenge lies in the complex nature of
PNC samples, which have numerous attributes scattered throughout the text. The
complexity of annotating detailed information on PNCs limits the availability
of data, making conventional document-level relation extraction techniques
impractical due to the challenge in creating comprehensive named entity span
annotations. To address this, we introduce a new benchmark and an evaluation
technique for this task and explore different prompting strategies in a
zero-shot manner. We also incorporate self-consistency to improve the
performance. Our findings show that even advanced LLMs struggle to extract all
of the samples from an article. Finally, we analyze the errors encountered in
this process, categorizing them into three main challenges, and discuss
potential strategies for future research to overcome them.
Related papers
- Leveraging the Power of LLMs: A Fine-Tuning Approach for High-Quality Aspect-Based Summarization [25.052557735932535]
Large language models (LLMs) have demonstrated the potential to revolutionize diverse tasks within natural language processing.
This paper explores the potential of fine-tuning LLMs for the aspect-based summarization task.
We evaluate the impact of fine-tuning open-source foundation LLMs, including Llama2, Mistral, Gemma and Aya, on a publicly available domain-specific aspect based summary dataset.
arXiv Detail & Related papers (2024-08-05T16:00:21Z) - Synthesizing Scientific Summaries: An Extractive and Abstractive Approach [0.5904095466127044]
We propose a hybrid methodology for research paper summarisation.
We use two models based on unsupervised learning for the extraction stage and two transformer language models.
We find that using certain combinations of hyper parameters, it is possible for automated summarisation systems to exceed the abstractiveness of summaries written by humans.
arXiv Detail & Related papers (2024-07-29T08:21:42Z) - TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
TACT contains challenging instructions that demand stitching information scattered across one or more texts.
We construct this dataset by leveraging an existing dataset of texts and their associated tables.
We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - ACLSum: A New Dataset for Aspect-based Summarization of Scientific
Publications [10.529898520273063]
ACLSum is a novel summarization dataset carefully crafted and evaluated by domain experts.
In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers.
arXiv Detail & Related papers (2024-03-08T13:32:01Z) - Prompting LLMs with content plans to enhance the summarization of
scientific articles [0.19183348587701113]
We conceive, implement, and evaluate prompting techniques to guide summarization systems.
We feed summarizers with lists of key terms extracted from articles.
Results show performance gains, especially for smaller models summarizing sections separately.
arXiv Detail & Related papers (2023-12-13T16:57:31Z) - SEMQA: Semi-Extractive Multi-Source Question Answering [94.04430035121136]
We introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion.
We create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions.
arXiv Detail & Related papers (2023-11-08T18:46:32Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles [136.84278943588652]
We propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event.
To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm.
The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference.
arXiv Detail & Related papers (2023-09-17T20:28:17Z) - PcMSP: A Dataset for Scientific Action Graphs Extraction from
Polycrystalline Materials Synthesis Procedure Text [1.9573380763700712]
This dataset simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations.
A two-step human annotation and inter-annotator agreement study guarantee the high quality of the PcMSP corpus.
We introduce four natural language processing tasks: sentence classification, named entity recognition, relation classification, and joint extraction of entities and relations.
arXiv Detail & Related papers (2022-10-22T09:43:54Z) - Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets [83.749895930242]
We propose two techniques for producing high-quality naturalistic synthetic occluded faces.
We empirically show the effectiveness and robustness of both methods, even for unseen occlusions.
We present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild.
arXiv Detail & Related papers (2022-05-12T17:03:57Z) - Abstractive Query Focused Summarization with Query-Free Resources [60.468323530248945]
In this work, we consider the problem of leveraging only generic summarization resources to build an abstractive QFS system.
We propose Marge, a Masked ROUGE Regression framework composed of a novel unified representation for summaries and queries.
Despite learning from minimal supervision, our system achieves state-of-the-art results in the distantly supervised setting.
arXiv Detail & Related papers (2020-12-29T14:39:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.