PQA: Zero-shot Protein Question Answering for Free-form Scientific
Enquiry with Large Language Models
- URL: http://arxiv.org/abs/2402.13653v1
- Date: Wed, 21 Feb 2024 09:38:17 GMT
- Title: PQA: Zero-shot Protein Question Answering for Free-form Scientific
Enquiry with Large Language Models
- Authors: Eli M Carrami and Sahand Sharifzadeh
- Abstract summary: We introduce the novel task of zero-shot Protein Question Answering (PQA) for free-form scientific enquiry.
Given a previously unseen protein sequence and a natural language question, the task is to deliver a scientifically accurate answer.
We contribute the first specialized dataset for PQA model training, containing 257K protein sequences annotated with 1.97M scientific question-answer pairs.
- Score: 5.062600294117055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the novel task of zero-shot Protein Question Answering (PQA) for
free-form scientific enquiry. Given a previously unseen protein sequence and a
natural language question, the task is to deliver a scientifically accurate
answer. This task not only supports future biological research, but could also
provide a test bed for assessing the scientific precision of large language
models (LLMs). We contribute the first specialized dataset for PQA model
training, containing 257K protein sequences annotated with 1.97M scientific
question-answer pairs. Additionally, we propose and study several novel
biologically relevant benchmarks for scientific PQA. Employing two robust
multi-modal architectures, we establish an initial state-of-the-art performance
for PQA and reveal key performance factors through ablation studies. Our
comprehensive PQA framework, named Pika, including dataset, code, model
checkpoints, and a user-friendly demo, is openly accessible on
github.com/EMCarrami/Pika, promoting wider research and application in the
field.
Related papers
- ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth.
Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format.
This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful.
We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z) - SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation [11.129800893611646]
SciQAG is a framework for automatically generating high-quality science question-answer pairs from a large corpus of scientific literature based on large language models (LLMs)
We construct a large-scale, high-quality, open-ended science QA dataset containing 188,042 QA pairs extracted from 22,743 scientific papers across 24 scientific domains.
We also introduce SciQAG-24D, a new benchmark task designed to evaluate the science question-answering ability of LLMs.
arXiv Detail & Related papers (2024-05-16T09:42:37Z) - Around the GLOBE: Numerical Aggregation Question-Answering on
Heterogeneous Genealogical Knowledge Graphs with Deep Neural Networks [0.934612743192798]
We present a new end-to-end methodology for numerical aggregation QA for genealogical trees.
The proposed architecture, GLOBE, outperforms the state-of-the-art models and pipelines by achieving 87% accuracy for this task.
This study may have practical implications for genealogical information centers and museums.
arXiv Detail & Related papers (2023-07-30T12:09:00Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - ProQA: Structural Prompt-based Pre-training for Unified Question
Answering [84.59636806421204]
ProQA is a unified QA paradigm that solves various tasks through a single model.
It concurrently models the knowledge generalization for all QA tasks while keeping the knowledge customization for every specific QA task.
ProQA consistently boosts performance on both full data fine-tuning, few-shot learning, and zero-shot testing scenarios.
arXiv Detail & Related papers (2022-05-09T04:59:26Z) - Science Checker: Extractive-Boolean Question Answering For Scientific
Fact Checking [0.0]
We propose a multi-task approach for verifying the scientific questions based on a joint reasoning from facts and evidence in research articles.
With our light and fast proposed architecture, we achieved an average error rate of 4% and a F1-score of 95.6%.
arXiv Detail & Related papers (2022-04-26T12:35:23Z) - CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training [21.07506671340319]
We propose a novel question-answering dataset based on the Common Crawl project in this paper.
We extract around 130 million multilingual question-answer pairs, including about 60 million English data-points.
With this previously unseen number of natural QA pairs, we pre-train popular language models to show the potential of large-scale in-domain pre-training for the task of question-answering.
arXiv Detail & Related papers (2021-10-14T21:23:01Z) - TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and
Textual Content in Finance [71.76018597965378]
We build a new large-scale Question Answering dataset containing both Tabular And Textual data, named TAT-QA.
We propose a novel QA model termed TAGOP, which is capable of reasoning over both tables and text.
arXiv Detail & Related papers (2021-05-17T06:12:06Z) - CliniQG4QA: Generating Diverse Questions for Domain Adaptation of
Clinical Question Answering [27.45623324582005]
Clinical question answering (QA) aims to automatically answer questions from medical professionals based on clinical texts.
We propose CliniQG4QA, which leverages question generation (QG) to synthesize QA pairs on new clinical contexts.
In order to generate diverse types of questions that are essential for training QA models, we introduce a seq2seq-based question phrase prediction (QPP) module.
arXiv Detail & Related papers (2020-10-30T02:06:10Z) - Understanding Unnatural Questions Improves Reasoning over Text [54.235828149899625]
Complex question answering (CQA) over raw text is a challenging task.
Learning an effective CQA model requires large amounts of human-annotated data.
We address the challenge of learning a high-quality programmer (parser) by projecting natural human-generated questions into unnatural machine-generated questions.
arXiv Detail & Related papers (2020-10-19T10:22:16Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.