Related papers: QuantumLLMInstruct: A 500k LLM Instruction-Tuning Dataset with Problem-Solution Pairs for Quantum Computing

QuantumLLMInstruct: A 500k LLM Instruction-Tuning Dataset with Problem-Solution Pairs for Quantum Computing

URL: http://arxiv.org/abs/2412.20956v1
Date: Mon, 30 Dec 2024 13:53:51 GMT
Title: QuantumLLMInstruct: A 500k LLM Instruction-Tuning Dataset with Problem-Solution Pairs for Quantum Computing
Authors: Shlomo Kashani,
Abstract summary: We present QuantumLLMInstruct (QLMMI), the largest and most comprehensive dataset of its kind.<n>QLMMI features over 500,000 meticulously curated instruction-following problem-solution pairs designed specifically for quantum computing.
Score: 1.90365714903665
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present QuantumLLMInstruct (QLMMI), an innovative dataset featuring over 500,000 meticulously curated instruction-following problem-solution pairs designed specifically for quantum computing - the largest and most comprehensive dataset of its kind. Originating from over 90 primary seed domains and encompassing hundreds of subdomains autonomously generated by LLMs, QLMMI marks a transformative step in the diversity and richness of quantum computing datasets. Designed for instruction fine-tuning, QLMMI seeks to significantly improve LLM performance in addressing complex quantum computing challenges across a wide range of quantum physics topics. While Large Language Models (LLMs) have propelled advancements in computational science with datasets like Omni-MATH and OpenMathInstruct, these primarily target Olympiad-level mathematics, leaving quantum computing largely unexplored. The creation of QLMMI follows a rigorous four-stage methodology. Initially, foundational problems are developed using predefined templates, focusing on critical areas such as synthetic Hamiltonians, QASM code generation, Jordan-Wigner transformations, and Trotter-Suzuki quantum circuit decompositions. Next, detailed and domain-specific solutions are crafted to ensure accuracy and relevance. In the third stage, the dataset is enriched through advanced reasoning techniques, including Chain-of-Thought (CoT) and Task-Oriented Reasoning and Action (ToRA), which enhance problem-solution diversity while adhering to strict mathematical standards. Lastly, a zero-shot Judge LLM performs self-assessments to validate the dataset's quality and reliability, minimizing human oversight requirements.

Related papers

Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
Advances in Machine Learning: Where Can Quantum Techniques Help? [0.0]
Quantum Machine Learning (QML) represents a promising frontier at the intersection of quantum computing and artificial intelligence.<n>This review explores the potential of QML to address the computational bottlenecks of classical machine learning.
arXiv Detail & Related papers (2025-07-11T07:47:47Z)
Outlier-Robust Multi-Model Fitting on Quantum Annealers [29.24367815462826]
Multi-model fitting (MMF) presents a significant challenge in Computer Vision. Existing quantum-based approaches for model fitting are either limited to a single model or consider multi-model scenarios within outlier-free datasets. This paper introduces a novel approach, the robust quantum multi-model fitting (R-QuMF) algorithm to handle outliers effectively.
arXiv Detail & Related papers (2025-04-18T17:59:53Z)
Fine-Tuning Large Language Models on Quantum Optimization Problems for Circuit Generation [4.447306899057931]
Large language models (LLM) have achieved remarkable outcomes in addressing complex problems. This paper shows how to leverage LLMs to automatically generate quantum circuits at a large scale. We have prepared 14,000 quantum circuits covering a substantial part of the quantum optimization landscape.
arXiv Detail & Related papers (2025-04-15T11:56:54Z)
An Efficient Quantum Classifier Based on Hamiltonian Representations [50.467930253994155]
Quantum machine learning (QML) is a discipline that seeks to transfer the advantages of quantum computing to data-driven tasks. We propose an efficient approach that circumvents the costs associated with data encoding by mapping inputs to a finite set of Pauli strings. We evaluate our approach on text and image classification tasks, against well-established classical and quantum models.
arXiv Detail & Related papers (2025-04-13T11:49:53Z)
Quantum Bayesian Networks for Machine Learning in Oil-Spill Detection [3.9554540293311864]
This paper introduces a novel Bayesian approach using Quantum Bayesian Networks (QBNs) to classify imbalanced datasets.<n>We effectively address the challenge of integrating quantum enhancements with classical machine learning architectures.<n>Our study demonstrates significant advances in detecting and classifying anomalies, contributing to more effective and precise environmental monitoring and management.
arXiv Detail & Related papers (2024-12-24T15:44:26Z)
QCircuitNet: A Large-Scale Hierarchical Dataset for Quantum Algorithm Design [17.747641494506087]
We introduce QCircuitNet, the first benchmark and test dataset designed to evaluate AI's capability in designing and implementing quantum algorithms. Unlike using AI for writing traditional codes, this task is fundamentally different and significantly more complicated due to highly flexible design space and intricate manipulation of qubits.
arXiv Detail & Related papers (2024-10-10T14:24:30Z)
Generalization Error Bound for Quantum Machine Learning in NISQ Era -- A Survey [37.69303106863453]
We conduct a Systematic Mapping Study (SMS) to explore the state-of-the-art generalization bound for supervised Quantum Machine Learning (QML) in the Noisy Intermediate-Scale Quantum (NISQ) era. Our study systematically summarizes the existing computational platforms with quantum hardware, datasets, optimization techniques, and the common properties of the bounds found in the literature. The SMS also highlights the limitations and challenges in QML in the NISQ era and discusses future research directions to advance the field.
arXiv Detail & Related papers (2024-09-11T21:17:30Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
Efficient Learning for Linear Properties of Bounded-Gate Quantum Circuits [63.733312560668274]
Given a quantum circuit containing d tunable RZ gates and G-d Clifford gates, can a learner perform purely classical inference to efficiently predict its linear properties? We prove that the sample complexity scaling linearly in d is necessary and sufficient to achieve a small prediction error, while the corresponding computational complexity may scale exponentially in d. We devise a kernel-based learning model capable of trading off prediction error and computational complexity, transitioning from exponential to scaling in many practical settings.
arXiv Detail & Related papers (2024-08-22T08:21:28Z)
Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models [1.8213213818713139]
We introduce and use the Qiskit HumanEval dataset to benchmark the ability of Large Language Models to produce quantum code. This dataset consists of more than 100 quantum computing tasks, each accompanied by a prompt, a canonical solution, and a difficulty scale to evaluate the correctness of the generated solutions.
arXiv Detail & Related papers (2024-06-20T20:14:22Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation. We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets. We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z)
Quantum algorithms: A survey of applications and end-to-end complexities [90.05272647148196]
The anticipated applications of quantum computers span across science and industry. We present a survey of several potential application areas of quantum algorithms. We outline the challenges and opportunities in each area in an "end-to-end" fashion.
arXiv Detail & Related papers (2023-10-04T17:53:55Z)
QKSAN: A Quantum Kernel Self-Attention Network [53.96779043113156]
A Quantum Kernel Self-Attention Mechanism (QKSAM) is introduced to combine the data representation merit of Quantum Kernel Methods (QKM) with the efficient information extraction capability of SAM. A Quantum Kernel Self-Attention Network (QKSAN) framework is proposed based on QKSAM, which ingeniously incorporates the Deferred Measurement Principle (DMP) and conditional measurement techniques. Four QKSAN sub-models are deployed on PennyLane and IBM Qiskit platforms to perform binary classification on MNIST and Fashion MNIST.
arXiv Detail & Related papers (2023-08-25T15:08:19Z)
QDataset: Quantum Datasets for Machine Learning [1.160208922584163]
The QDataSet is a quantum dataset designed specifically to facilitate the training and development of QML algorithms. The datasets are structured to provide a wealth of information to enable machine learning practitioners to use the QDataSet to solve problems in applied quantum computation. Accompanying the datasets on the associated GitHub repository are a set of demonstrating the use of the QDataSet in a range of optimisation contexts.
arXiv Detail & Related papers (2021-08-15T05:30:59Z)
Quantum Federated Learning with Quantum Data [87.49715898878858]
Quantum machine learning (QML) has emerged as a promising field that leans on the developments in quantum computing to explore large complex machine learning problems. This paper proposes the first fully quantum federated learning framework that can operate over quantum data and, thus, share the learning of quantum circuit parameters in a decentralized manner.
arXiv Detail & Related papers (2021-05-30T12:19:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.