A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery
- URL: http://arxiv.org/abs/2406.10833v1
- Date: Sun, 16 Jun 2024 08:03:24 GMT
- Title: A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery
- Authors: Yu Zhang, Xiusi Chen, Bowen Jin, Sheng Wang, Shuiwang Ji, Wei Wang, Jiawei Han,
- Abstract summary: We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
We comprehensively survey over 250 scientific LLMs, discuss their commonalities and differences, as well as summarize pre-training datasets and evaluation tasks for each field and modality.
- Score: 68.48094108571432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many scientific fields, large language models (LLMs) have revolutionized the way with which text and other modalities of data (e.g., molecules and proteins) are dealt, achieving superior performance in various applications and augmenting the scientific discovery process. Nevertheless, previous surveys on scientific LLMs often concentrate on one to two fields or a single modality. In this paper, we aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs regarding their architectures and pre-training techniques. To this end, we comprehensively survey over 250 scientific LLMs, discuss their commonalities and differences, as well as summarize pre-training datasets and evaluation tasks for each field and modality. Moreover, we investigate how LLMs have been deployed to benefit scientific discovery. Resources related to this survey are available at https://github.com/yuzhimanhua/Awesome-Scientific-Language-Models.
Related papers
- MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension [59.41495657570397]
We collected a multimodal, multidisciplinary dataset from open-access scientific articles published in Nature Communications journals.
This dataset spans 72 scientific disciplines, ensuring both diversity and quality.
We created benchmarks with various tasks and settings to comprehensively evaluate LMMs' capabilities in understanding scientific figures and content.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals.
Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z) - Uni-SMART: Universal Science Multimodal Analysis and Research Transformer [22.90687836544612]
We present bfUni-text, an innovative model designed for in-depth understanding of scientific literature.
Uni-text demonstrates superior performance over other text-focused LLMs.
Our exploration extends to practical applications, including patent infringement detection and nuanced analysis of charts.
arXiv Detail & Related papers (2024-03-15T13:43:47Z) - A Survey of Pre-trained Language Models for Processing Scientific Text [26.986805626077892]
The number of Language Models (LMs) dedicated to processing scientific text is on the rise.
This work provides a comprehensive review of SciLMs, including an analysis of their effectiveness across different domains, tasks and datasets.
arXiv Detail & Related papers (2024-01-31T13:35:07Z) - Scientific Large Language Models: A Survey on Biological & Chemical Domains [47.97810890521825]
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension.
The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines.
As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration.
arXiv Detail & Related papers (2024-01-26T05:33:34Z) - An Interdisciplinary Outlook on Large Language Models for Scientific
Research [3.4108358650013573]
We describe the capabilities and constraints of Large Language Models (LLMs) within disparate academic disciplines, aiming to delineate their strengths and limitations with precision.
We examine how LLMs augment scientific inquiry, offering concrete examples such as accelerating literature review by summarizing vast numbers of publications.
We articulate the challenges LLMs face, including their reliance on extensive and sometimes biased datasets, and the potential ethical dilemmas stemming from their use.
arXiv Detail & Related papers (2023-11-03T19:41:09Z) - A Bibliometric Review of Large Language Models Research from 2017 to
2023 [1.4190701053683017]
Large language models (LLMs) are language models that have demonstrated outstanding performance across a range of natural language processing (NLP) tasks.
This paper serves as a roadmap for researchers, practitioners, and policymakers to navigate the current landscape of LLMs research.
arXiv Detail & Related papers (2023-04-03T21:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.