Related papers: Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo

Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo

URL: http://arxiv.org/abs/2405.02128v1
Date: Fri, 3 May 2024 14:29:54 GMT
Title: Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo
Authors: Nakul Rampal, Kaiyu Wang, Matthew Burigana, Lingxiang Hou, Juri Al-Johani, Anna Sackmann, Hanan S. Murayshid, Walaa Abdullah Al-Sumari, Arwa M. Al-Abdulkarim, Nahla Eid Al-Hazmi, Majed O. Al-Awad, Christian Borgs, Jennifer T. Chayes, Omar M. Yaghi,
Abstract summary: 'RetChemQA' is a benchmark dataset designed to evaluate the capabilities of machine learning models in the domain of reticular chemistry. This dataset includes both single-hop and multi-hop question-answer pairs, encompassing approximately 45,000 Q&As for each type. The questions have been extracted from an extensive corpus of literature containing about 2,530 research papers from publishers including NAS, ACS, RSC, Elsevier, and Nature Publishing Group.
Score: 0.5110571587151475
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement in artificial intelligence and natural language processing has led to the development of large-scale datasets aimed at benchmarking the performance of machine learning models. Herein, we introduce 'RetChemQA,' a comprehensive benchmark dataset designed to evaluate the capabilities of such models in the domain of reticular chemistry. This dataset includes both single-hop and multi-hop question-answer pairs, encompassing approximately 45,000 Q&As for each type. The questions have been extracted from an extensive corpus of literature containing about 2,530 research papers from publishers including NAS, ACS, RSC, Elsevier, and Nature Publishing Group, among others. The dataset has been generated using OpenAI's GPT-4 Turbo, a cutting-edge model known for its exceptional language understanding and generation capabilities. In addition to the Q&A dataset, we also release a dataset of synthesis conditions extracted from the corpus of literature used in this study. The aim of RetChemQA is to provide a robust platform for the development and evaluation of advanced machine learning algorithms, particularly for the reticular chemistry community. The dataset is structured to reflect the complexities and nuances of real-world scientific discourse, thereby enabling nuanced performance assessments across a variety of tasks. The dataset is available at the following link: https://github.com/nakulrampal/RetChemQA

Related papers

ChemRxivQuest: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv Preprints [0.0]
ChemRxivQuest is a curated dataset of 970 high-quality question-answer (QA) pairs from 155 ChemRxiv preprints across 17 subfields of chemistry.<n>Each QA pair is explicitly linked to its source text segment to ensure traceability and contextual accuracy.<n>ChemRxivQuest was constructed using an automated pipeline that combines optical character recognition (OCR), GPT-4o-based QA generation, and a fuzzy matching technique for answer verification.
arXiv Detail & Related papers (2025-05-08T13:26:33Z)
Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models [0.0]
This paper introduces a novel approach to enhancing closed-domain Question Answering (QA) systems. It focuses on the specific needs of the Lawrence Berkeley National Laboratory (LBL) Science Information Technology (ScienceIT) domain.
arXiv Detail & Related papers (2024-10-24T00:49:46Z)
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets. We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z)
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z)
ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth. Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations. We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models. The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS) The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z)
An Autonomous Large Language Model Agent for Chemical Literature Data Mining [60.85177362167166]
We introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature. Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data.
arXiv Detail & Related papers (2024-02-20T13:21:46Z)
Image and Data Mining in Reticular Chemistry Using GPT-4V [5.440238820637818]
GPT-4V is a large language model featuring enhanced vision capabilities, accessible through ChatGPT or an API. This study demonstrates the remarkable ability of GPT-4V to navigate and obtain complex data for metal-organic frameworks.
arXiv Detail & Related papers (2023-12-09T05:05:25Z)
Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study on Telematics Data with ChatGPT [0.0]
This research delves into the construction and utilization of synthetic datasets, specifically within the telematics sphere, leveraging OpenAI's powerful language model, ChatGPT. To illustrate this data creation process, a hands-on case study is conducted, focusing on the generation of a synthetic telematics dataset.
arXiv Detail & Related papers (2023-06-23T15:15:13Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
QDataset: Quantum Datasets for Machine Learning [1.160208922584163]
The QDataSet is a quantum dataset designed specifically to facilitate the training and development of QML algorithms. The datasets are structured to provide a wealth of information to enable machine learning practitioners to use the QDataSet to solve problems in applied quantum computation. Accompanying the datasets on the associated GitHub repository are a set of demonstrating the use of the QDataSet in a range of optimisation contexts.
arXiv Detail & Related papers (2021-08-15T05:30:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.