PQA: Zero-shot Protein Question Answering for Free-form Scientific
Enquiry with Large Language Models
- URL: http://arxiv.org/abs/2402.13653v1
- Date: Wed, 21 Feb 2024 09:38:17 GMT
- Title: PQA: Zero-shot Protein Question Answering for Free-form Scientific
Enquiry with Large Language Models
- Authors: Eli M Carrami and Sahand Sharifzadeh
- Abstract summary: We introduce the novel task of zero-shot Protein Question Answering (PQA) for free-form scientific enquiry.
Given a previously unseen protein sequence and a natural language question, the task is to deliver a scientifically accurate answer.
We contribute the first specialized dataset for PQA model training, containing 257K protein sequences annotated with 1.97M scientific question-answer pairs.
- Score: 5.062600294117055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the novel task of zero-shot Protein Question Answering (PQA) for
free-form scientific enquiry. Given a previously unseen protein sequence and a
natural language question, the task is to deliver a scientifically accurate
answer. This task not only supports future biological research, but could also
provide a test bed for assessing the scientific precision of large language
models (LLMs). We contribute the first specialized dataset for PQA model
training, containing 257K protein sequences annotated with 1.97M scientific
question-answer pairs. Additionally, we propose and study several novel
biologically relevant benchmarks for scientific PQA. Employing two robust
multi-modal architectures, we establish an initial state-of-the-art performance
for PQA and reveal key performance factors through ablation studies. Our
comprehensive PQA framework, named Pika, including dataset, code, model
checkpoints, and a user-friendly demo, is openly accessible on
github.com/EMCarrami/Pika, promoting wider research and application in the
field.
Related papers
- PeerQA: A Scientific Question Answering Dataset from Peer Reviews [51.95579001315713]
We present PeerQA, a real-world, scientific, document-level Question Answering dataset.
The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP.
We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks.
arXiv Detail & Related papers (2025-02-19T12:24:46Z) - Prot2Chat: Protein LLM with Early Fusion of Sequence and Structure [7.9473027178525975]
Prot2Chat is a novel framework that integrates multimodal protein representations with natural language through a unified module.
Our model incorporates a modified ProteinMPNN encoder, which encodes protein sequence and structural information in a unified manner, and a protein-text adapter with cross-attention mechanisms.
arXiv Detail & Related papers (2025-02-07T05:23:16Z) - Open-Source Protein Language Models for Function Prediction and Protein Design [0.0]
Protein language models (PLMs) have shown promise in improving the understanding of protein sequences, contributing to advances in areas such as function prediction and protein engineering.
We integrate a PLM into DeepChem, an open-source framework for computational biology and chemistry, to provide a more accessible platform for protein-related tasks.
We evaluate the performance of the integrated model on various protein prediction tasks, showing that it achieves reasonable results across benchmarks.
arXiv Detail & Related papers (2024-12-18T05:41:15Z) - Multi-modal Representation Learning Enables Accurate Protein Function Prediction in Low-Data Setting [0.0]
HOPER (HOlistic ProtEin Representation) is a novel framework designed to enhance protein function prediction (PFP) in low-data settings.
Our results highlight the effectiveness of multimodal representation learning for overcoming data limitations in biological research.
arXiv Detail & Related papers (2024-11-22T20:13:55Z) - ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth.
Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format.
This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful.
We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z) - Reinforcement Learning for Sequence Design Leveraging Protein Language Models [14.477268882311991]
We propose to use protein language models (PLMs) as a reward function to generate new sequences.
We perform extensive experiments on various sequence lengths to benchmark RL-based approaches.
We provide comprehensive evaluations along biological plausibility and diversity of the protein.
arXiv Detail & Related papers (2024-07-03T14:31:36Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z) - TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and
Textual Content in Finance [71.76018597965378]
We build a new large-scale Question Answering dataset containing both Tabular And Textual data, named TAT-QA.
We propose a novel QA model termed TAGOP, which is capable of reasoning over both tables and text.
arXiv Detail & Related papers (2021-05-17T06:12:06Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.