Related papers: D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

URL: http://arxiv.org/abs/2508.01309v1
Date: Sat, 02 Aug 2025 10:45:05 GMT
Title: D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation
Authors: Weibo Zhou, Lingbo Li, Shangsong Liang,
Abstract summary: D-SCoRE is a training-free pipeline that produces high-quality QA datasets from arbitrary textual sources.<n>D-SCoRE generates six QA-CoT pairs with four-option counterfactual materials per 100-200-word text in 90 seconds.<n>Its simplicity and scalability enable efficient QA generation and high-performance fine-tuning across domains.
Score: 12.271220269415878
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The scarcity and high cost of high-quality question-answering (QA) datasets hinder supervised fine-tuning (SFT) for domain-specific large language models (LLMs). To address this, we introduce D-SCoRE, a training-free pipeline that utilizes LLMs and prompt engineering to produce diverse, high-quality QA datasets from arbitrary textual sources. D-SCoRE integrates $\textbf{D}$ocument-centric processing, $\textbf{S}$egmentation, $\textbf{Co}$T $\textbf{R}$easoning, and structured $\textbf{E}$xport to generate QA-COT datasets tailored for domain-aware SFT. Multi-dimensional control mechanisms, such as semantic role transformation, question type balancing, and counterfactual materials, enhance diversity and relevance, overcoming limitations of existing QA generation. LLMs fine-tuned on D-SCoRE-generated QA datasets, and human-annotated QA datasets (SQuAD, Covid-QA) are evaluated on SQuADShifts and Covid-QA test sets, with D-SCoRE outperforming across most domains. D-SCoRE generates six QA-CoT pairs with four-option counterfactual materials per 100-200-word text in 90 seconds using an 8B LLM on consumer-grade hardware. Its simplicity and scalability enable efficient QA generation and high-performance fine-tuning across domains.

Related papers

SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting [16.86139440201837]
We introduce SustainableQA, a novel dataset and a scalable pipeline for generating a comprehensive QA datasets from corporate sustainability reports and annual reports.<n>With over 195,000 diverse factoid and non-factoid QA pairs, SustainableQA is an effective resource for developing and benchmarking advanced knowledge assistants.
arXiv Detail & Related papers (2025-08-05T02:03:59Z)
Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks [56.350173737493215]
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency.<n>MMESGBench is a first-of-its-kind benchmark dataset to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents.<n>MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories.
arXiv Detail & Related papers (2025-07-25T03:58:07Z)
TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension [8.489816179329832]
We present TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities of large language models (LLMs) in tackling complex QA tasks over relational data.<n>Our benchmark incorporates diverse relational database instances sourced from real-world public datasets.<n>We systematically evaluate a range of LLMs, both open-source and closed-source, spanning model scales from 7 billion to 70 billion parameters.
arXiv Detail & Related papers (2024-11-29T06:48:13Z)
SEC-QA: A Systematic Evaluation Corpus for Financial QA [12.279234447220155]
Existing datasets are often constrained by size, context, or relevance to practical applications. We propose SEC-QA, a continuous dataset generation framework with two key features. We introduce a QA system based on program-of-thought that improves the ability to perform complex information retrieval and quantitative reasoning pipelines.
arXiv Detail & Related papers (2024-06-20T15:12:41Z)
FinTextQA: A Dataset for Long-form Financial Question Answering [10.1084081290893]
FinTextQA is a novel dataset for long-form question answering (LFQA) in finance. The most effective system configuration on our dataset involved setting the embedder, retriever, reranker, and generator as Ada2, Automated Merged Retrieval, Bge-Reranker-Base, and Baichuan2-7B.
arXiv Detail & Related papers (2024-05-16T10:53:31Z)
Automatic Question-Answer Generation for Long-Tail Knowledge [65.11554185687258]
We propose an automatic approach to generate specialized QA datasets for tail entities. We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets.
arXiv Detail & Related papers (2024-03-03T03:06:31Z)
QASnowball: An Iterative Bootstrapping Framework for High-Quality Question-Answering Data Generation [67.27999343730224]
We introduce an iterative bootstrapping framework for QA data augmentation (named QASnowball) QASnowball can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples. We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models.
arXiv Detail & Related papers (2023-09-19T05:20:36Z)
Long-Tailed Question Answering in an Open World [46.67715607552547]
We define Open Long-Tailed QA (OLTQA) as learning from long-tailed distributed data. We propose an OLTQA model that encourages knowledge sharing between head, tail and unseen tasks. On a large-scale OLTQA dataset, our model consistently outperforms the state-of-the-art.
arXiv Detail & Related papers (2023-05-11T04:28:58Z)
Self-Prompting Large Language Models for Zero-Shot Open-Domain QA [67.08732962244301]
Open-Domain Question Answering (ODQA) aims to answer questions without explicitly providing background documents. This task becomes notably challenging in a zero-shot setting where no data is available to train tailored retrieval-reader models. We propose a Self-Prompting framework to explicitly utilize the massive knowledge encoded in the parameters of Large Language Models.
arXiv Detail & Related papers (2022-12-16T18:23:43Z)
Parameter-Efficient Abstractive Question Answering over Tables or Text [60.86457030988444]
A long-term ambition of information seeking QA systems is to reason over multi-modal contexts and generate natural answers to user queries. Memory intensive pre-trained language models are adapted to downstream tasks such as QA by fine-tuning the model on QA data in a specific modality like unstructured text or structured tables. To avoid training such memory-hungry models while utilizing a uniform architecture for each modality, parameter-efficient adapters add and train small task-specific bottle-neck layers between transformer layers.
arXiv Detail & Related papers (2022-04-07T10:56:29Z)
Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts. Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models [56.268862325167575]
This paper presents an empirical study of conversational question reformulation (CQR) with sequence-to-sequence architectures and pretrained language models (PLMs) We leverage PLMs to address the strong token-to-token independence assumption made in the common objective, maximum likelihood estimation, for the CQR task. We evaluate fine-tuned PLMs on the recently-introduced CANARD dataset as an in-domain task and validate the models using data from the TREC 2019 CAsT Track as an out-domain task.
arXiv Detail & Related papers (2020-04-04T11:07:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.