Language agents achieve superhuman synthesis of scientific knowledge
- URL: http://arxiv.org/abs/2409.13740v2
- Date: Thu, 26 Sep 2024 15:27:08 GMT
- Title: Language agents achieve superhuman synthesis of scientific knowledge
- Authors: Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, Andrew D. White,
- Abstract summary: PaperQA2 is a frontier language model agent optimized for improved factuality, matches or exceeds subject matter expert performance.
PaperQA2 writes cited, Wikipedia-style summaries of scientific topics that are significantly more accurate than existing, human-written Wikipedia articles.
We apply PaperQA2 to identify contradictions within the scientific literature, an important scientific task that is challenging for humans.
- Score: 0.7635132958167216
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Language models are known to hallucinate incorrect information, and it is unclear if they are sufficiently accurate and reliable for use in scientific research. We developed a rigorous human-AI comparison methodology to evaluate language model agents on real-world literature search tasks covering information retrieval, summarization, and contradiction detection tasks. We show that PaperQA2, a frontier language model agent optimized for improved factuality, matches or exceeds subject matter expert performance on three realistic literature research tasks without any restrictions on humans (i.e., full access to internet, search tools, and time). PaperQA2 writes cited, Wikipedia-style summaries of scientific topics that are significantly more accurate than existing, human-written Wikipedia articles. We also introduce a hard benchmark for scientific literature research called LitQA2 that guided design of PaperQA2, leading to it exceeding human performance. Finally, we apply PaperQA2 to identify contradictions within the scientific literature, an important scientific task that is challenging for humans. PaperQA2 identifies 2.34 +/- 1.99 contradictions per paper in a random subset of biology papers, of which 70% are validated by human experts. These results demonstrate that language model agents are now capable of exceeding domain experts across meaningful tasks on scientific literature.
Related papers
- Detecting Reference Errors in Scientific Literature with Large Language Models [0.552480439325792]
This work evaluated the ability of large language models in OpenAI's GPT family to detect quotation errors.
Results showed that large language models are able to detect erroneous citations with limited context and without fine-tuning.
arXiv Detail & Related papers (2024-11-09T07:30:38Z) - SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers [20.273439120429025]
SciDQA is a new dataset for reading comprehension that challenges LLMs for a deep understanding of scientific articles.
Unlike other scientific QA datasets, SciDQA sources questions from peer reviews by domain experts and answers by paper authors.
Questions in SciDQA necessitate reasoning across figures, tables, equations, appendices, and supplementary materials.
arXiv Detail & Related papers (2024-11-08T05:28:22Z) - BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science [43.624608816218505]
BioKGBench is an evaluation benchmark for AI-driven biomedical agents.
We first disentangle "Understanding Literature" into two atomic abilities.
We then formulate a novel agent task, dubbed KGCheck, using KGQA and domain-based Retrieval-Augmented Generation.
We collect over two thousand data for two atomic tasks and 225 high-quality annotated data for the agent task.
arXiv Detail & Related papers (2024-06-29T15:23:28Z) - Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts [49.97673761305336]
We evaluate three large language models (LLMs) for their alignment with human narrative styles and potential gender biases.
Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases.
arXiv Detail & Related papers (2024-06-27T19:26:11Z) - ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [56.08917291606421]
ResearchAgent is a large language model-powered research idea writing agent.
It generates problems, methods, and experiment designs while iteratively refining them based on scientific literature.
We experimentally validate our ResearchAgent on scientific publications across multiple disciplines.
arXiv Detail & Related papers (2024-04-11T13:36:29Z) - PaperQA: Retrieval-Augmented Generative Agent for Scientific Research [41.9628176602676]
We present PaperQA, a RAG agent for answering questions over the scientific literature.
PaperQA is an agent that performs information retrieval across full-text scientific articles, assesses the relevance of sources and passages, and uses RAG to provide answers.
We also introduce LitQA, a more complex benchmark that requires retrieval and synthesis of information from full-text scientific papers across the literature.
arXiv Detail & Related papers (2023-12-08T18:50:20Z) - Is This Abstract Generated by AI? A Research for the Gap between
AI-generated Scientific Text and Human-written Scientific Text [13.438933219811188]
We investigate the gap between scientific content generated by AI and written by humans.
We find that there exists a writing style'' gap between AI-generated scientific text and human-written scientific text.
arXiv Detail & Related papers (2023-01-24T04:23:20Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information.
We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z) - Explaining Relationships Between Scientific Documents [55.23390424044378]
We address the task of explaining relationships between two scientific documents using natural language text.
In this paper we establish a dataset of 622K examples from 154K documents.
arXiv Detail & Related papers (2020-02-02T03:54:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.