Related papers: PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models

PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models

URL: http://arxiv.org/abs/2402.13653v1
Date: Wed, 21 Feb 2024 09:38:17 GMT
Title: PQA: Zero-shot Protein Question Answering for Free-form Scientific Enquiry with Large Language Models
Authors: Eli M Carrami and Sahand Sharifzadeh
Abstract summary: We introduce the novel task of zero-shot Protein Question Answering (PQA) for free-form scientific enquiry. Given a previously unseen protein sequence and a natural language question, the task is to deliver a scientifically accurate answer. We contribute the first specialized dataset for PQA model training, containing 257K protein sequences annotated with 1.97M scientific question-answer pairs.
Score: 5.062600294117055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce the novel task of zero-shot Protein Question Answering (PQA) for free-form scientific enquiry. Given a previously unseen protein sequence and a natural language question, the task is to deliver a scientifically accurate answer. This task not only supports future biological research, but could also provide a test bed for assessing the scientific precision of large language models (LLMs). We contribute the first specialized dataset for PQA model training, containing 257K protein sequences annotated with 1.97M scientific question-answer pairs. Additionally, we propose and study several novel biologically relevant benchmarks for scientific PQA. Employing two robust multi-modal architectures, we establish an initial state-of-the-art performance for PQA and reveal key performance factors through ablation studies. Our comprehensive PQA framework, named Pika, including dataset, code, model checkpoints, and a user-friendly demo, is openly accessible on github.com/EMCarrami/Pika, promoting wider research and application in the field.

Related papers

Protein Large Language Models: A Comprehensive Survey [71.65899614084853]
Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. This work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications.
arXiv Detail & Related papers (2025-02-21T19:22:10Z)
PeerQA: A Scientific Question Answering Dataset from Peer Reviews [51.95579001315713]
We present PeerQA, a real-world, scientific, document-level Question Answering dataset. The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP. We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks.
arXiv Detail & Related papers (2025-02-19T12:24:46Z)
Prot2Chat: Protein LLM with Early Fusion of Sequence and Structure [7.9473027178525975]
Prot2Chat is a novel framework that integrates multimodal protein representations with natural language through a unified module. Our model incorporates a modified ProteinMPNN encoder, which encodes protein sequence and structural information in a unified manner, and a protein-text adapter with cross-attention mechanisms.
arXiv Detail & Related papers (2025-02-07T05:23:16Z)
Open-Source Protein Language Models for Function Prediction and Protein Design [0.0]
Protein language models (PLMs) have shown promise in improving the understanding of protein sequences, contributing to advances in areas such as function prediction and protein engineering. We integrate a PLM into DeepChem, an open-source framework for computational biology and chemistry, to provide a more accessible platform for protein-related tasks. We evaluate the performance of the integrated model on various protein prediction tasks, showing that it achieves reasonable results across benchmarks.
arXiv Detail & Related papers (2024-12-18T05:41:15Z)
Multi-modal Representation Learning Enables Accurate Protein Function Prediction in Low-Data Setting [0.0]
HOPER (HOlistic ProtEin Representation) is a novel framework designed to enhance protein function prediction (PFP) in low-data settings. Our results highlight the effectiveness of multimodal representation learning for overcoming data limitations in biological research.
arXiv Detail & Related papers (2024-11-22T20:13:55Z)
ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth. Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z)
Reinforcement Learning for Sequence Design Leveraging Protein Language Models [14.477268882311991]
We propose to use protein language models (PLMs) as a reward function to generate new sequences. We perform extensive experiments on various sequence lengths to benchmark RL-based approaches. We provide comprehensive evaluations along biological plausibility and diversity of the protein.
arXiv Detail & Related papers (2024-07-03T14:31:36Z)
ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases. Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions. We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z)
ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks. ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs. By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z)
Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data. The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database. PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z)
Data-Efficient Protein 3D Geometric Pretraining via Refinement of Diffused Protein Structure Decoy [42.49977473599661]
Learning meaningful protein representation is important for a variety of biological downstream tasks such as structure-based drug design. In this paper, we propose a unified framework for protein pretraining and a 3D geometric-based, data-efficient, and protein-specific pretext task: RefineDiff.
arXiv Detail & Related papers (2023-02-05T14:13:32Z)
Deep Learning Methods for Protein Family Classification on PDB Sequencing Data [0.0]
We demonstrate and compare the performance of several deep learning frameworks, including novel bi-directional LSTM and convolutional models, on widely available sequencing data. Our results show that our deep learning models deliver superior performance to classical machine learning methods, with the convolutional architecture providing the most impressive inference performance.
arXiv Detail & Related papers (2022-07-14T06:11:32Z)
Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins. In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information. We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
PDBench: Evaluating Computational Methods for Protein Sequence Design [2.0187324832551385]
We present a benchmark set of proteins and propose tests to assess the performance of deep learning based methods. Our robust benchmark provides biological insight into the behaviour of design methods, which is essential for evaluating their performance and utility.
arXiv Detail & Related papers (2021-09-16T12:20:03Z)
TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance [71.76018597965378]
We build a new large-scale Question Answering dataset containing both Tabular And Textual data, named TAT-QA. We propose a novel QA model termed TAGOP, which is capable of reasoning over both tables and text.
arXiv Detail & Related papers (2021-05-17T06:12:06Z)
Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data. We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.