On the Effectiveness of Large Language Models in Automating Categorization of Scientific Texts
- URL: http://arxiv.org/abs/2502.15745v1
- Date: Sat, 08 Feb 2025 20:37:21 GMT
- Title: On the Effectiveness of Large Language Models in Automating Categorization of Scientific Texts
- Authors: Gautam Kishore Shahi, Oliver Hummel,
- Abstract summary: We evaluate Large Language Models (LLMs) in their ability to sort scientific publications into hierarchical classifications systems.<n>Using the FORC dataset as ground truth data, we have found that recent LLMs are able to reach an accuracy of up to 0.82, which is up to 0.08 better than traditional BERT models.
- Score: 5.831737970661138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of Large Language Models (LLMs) has led to a multitude of application opportunities. One traditional task for Information Retrieval systems is the summarization and classification of texts, both of which are important for supporting humans in navigating large literature bodies as they e.g. exist with scientific publications. Due to this rapidly growing body of scientific knowledge, recent research has been aiming at building research information systems that not only offer traditional keyword search capabilities, but also novel features such as the automatic detection of research areas that are present at knowledge intensive organizations in academia and industry. To facilitate this idea, we present the results obtained from evaluating a variety of LLMs in their ability to sort scientific publications into hierarchical classifications systems. Using the FORC dataset as ground truth data, we have found that recent LLMs (such as Meta Llama 3.1) are able to reach an accuracy of up to 0.82, which is up to 0.08 better than traditional BERT models.
Related papers
- Science Hierarchography: Hierarchical Organization of Science Literature [20.182213614072836]
We motivate SCIENCE HARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure.
We develop a range of algorithms to achieve the goals of SCIENCE HIERARCHOGRAPHY.
Results show that this structured approach enhances interpretability, supports trend discovery, and offers an alternative pathway for exploring scientific literature beyond traditional search methods.
arXiv Detail & Related papers (2025-04-18T17:59:29Z) - NANOGPT: A Query-Driven Large Language Model Retrieval-Augmented Generation System for Nanotechnology Research [7.520798704421448]
Large Language Model Retrieval-Augmented Generation (LLM-RAG) system tailored for nanotechnology research.
System retrieves relevant literature by utilizing Google Scholar's advanced search, and scraping open-access papers from Elsevier, Springer Nature, and ACS Publications.
arXiv Detail & Related papers (2025-02-27T21:40:22Z) - SciPIP: An LLM-based Scientific Paper Idea Proposer [30.670219064905677]
We introduce SciPIP, an innovative framework designed to enhance the proposal of scientific ideas through improvements in both literature retrieval and idea generation.<n>Our experiments, conducted across various domains such as natural language processing and computer vision, demonstrate SciPIP's capability to generate a multitude of innovative and useful ideas.
arXiv Detail & Related papers (2024-10-30T16:18:22Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Towards Efficient Large Language Models for Scientific Text: A Review [4.376712802685017]
Large language models (LLMs) have ushered in a new era for processing complex information in various fields, including science.
Due to the power of LLMs, they require extremely expensive computational resources, intense amounts of data, and training time.
In recent years, researchers have proposed various methodologies to make scientific LLMs more affordable.
arXiv Detail & Related papers (2024-08-20T10:57:34Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [56.08917291606421]
ResearchAgent is an AI-based system for ideation and operationalization of novel work.<n>ResearchAgent automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them.<n>We experimentally validate our ResearchAgent on scientific publications across multiple disciplines.
arXiv Detail & Related papers (2024-04-11T13:36:29Z) - Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals.
Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z) - Using Large Language Models to Automate Category and Trend Analysis of
Scientific Articles: An Application in Ophthalmology [4.455826633717872]
We present an automated method for article classification, leveraging the power of Large Language Models (LLM)
The model achieved mean accuracy of 0.86 and mean F1 of 0.85 based on the RenD dataset.
The extendibility of the model to other scientific fields broadens its impact in facilitating research and trend analysis across diverse disciplines.
arXiv Detail & Related papers (2023-08-31T12:45:53Z) - MIReAD: Simple Method for Learning High-quality Representations from
Scientific Documents [77.34726150561087]
We propose MIReAD, a simple method that learns high-quality representations of scientific papers.
We train MIReAD on more than 500,000 PubMed and arXiv abstracts across over 2,000 journal classes.
arXiv Detail & Related papers (2023-05-07T03:29:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.