Quality > Quantity: Synthetic Corpora from Foundation Models for
Closed-Domain Extractive Question Answering
- URL: http://arxiv.org/abs/2310.16995v1
- Date: Wed, 25 Oct 2023 20:48:16 GMT
- Title: Quality > Quantity: Synthetic Corpora from Foundation Models for
Closed-Domain Extractive Question Answering
- Authors: Saptarshi Sengupta, Connor Heaton, Shreya Ghosh, Preslav Nakov,
Prasenjit Mitra
- Abstract summary: We study extractive question answering within closed domains and introduce the concept of targeted pre-training.
Our proposed framework uses Galactica to generate synthetic, targeted'' corpora that align with specific writing styles and topics.
- Score: 35.38140071573828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Domain adaptation, the process of training a model in one domain and applying
it to another, has been extensively explored in machine learning. While
training a domain-specific foundation model (FM) from scratch is an option,
recent methods have focused on adapting pre-trained FMs for domain-specific
tasks. However, our experiments reveal that either approach does not
consistently achieve state-of-the-art (SOTA) results in the target domain. In
this work, we study extractive question answering within closed domains and
introduce the concept of targeted pre-training. This involves determining and
generating relevant data to further pre-train our models, as opposed to the
conventional philosophy of utilizing domain-specific FMs trained on a wide
range of data. Our proposed framework uses Galactica to generate synthetic,
``targeted'' corpora that align with specific writing styles and topics, such
as research papers and radiology reports. This process can be viewed as a form
of knowledge distillation. We apply our method to two biomedical extractive
question answering datasets, COVID-QA and RadQA, achieving a new benchmark on
the former and demonstrating overall improvements on the latter. Code available
at https://github.com/saptarshi059/CDQA-v1-Targetted-PreTraining/tree/main.
Related papers
- Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.
However, they still struggle with problems requiring multi-step decision-making and environmental feedback.
We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - Ontology Matching with Large Language Models and Prioritized Depth-First Search [0.2454454561635539]
We introduce MILA, a novel approach that embeds a retrieve-identify-prompt pipeline within a prioritized depth-first search (PDFS) strategy.
This approach efficiently identifies a large number of semantic correspondences with high accuracy, limiting LLM requests to only the most borderline cases.
Our method achieved the highest F-Measure in four of the five unsupervised tasks, outperforming state-of-the-art OM systems by up to 17%.
arXiv Detail & Related papers (2025-01-20T12:29:09Z) - Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift.
We devise a series of experiments to explain the performance gap empirically.
arXiv Detail & Related papers (2024-09-27T05:06:43Z) - Large Language Models Are Self-Taught Reasoners: Enhancing LLM Applications via Tailored Problem-Solving Demonstrations [4.207253227315905]
We present SELF-TAUGHT, a problem-solving framework, which facilitates customized demonstrations.
In 15 tasks of multiple-choice questions, SELF-TAUGHT achieves superior performance to strong baselines.
We conduct comprehensive analyses on SELF-TAUGHT, including its generalizability to existing prompting methods.
arXiv Detail & Related papers (2024-08-22T11:41:35Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Developing Healthcare Language Model Embedding Spaces [0.20971479389679337]
Pre-trained Large Language Models (LLMs) often struggle on out-of-domain datasets like healthcare focused text.
Three methods are assessed: traditional masked language modeling, Deep Contrastive Learning for Unsupervised Textual Representations (DeCLUTR) and a novel pre-training objective utilizing metadata categories from the healthcare settings.
Contrastively trained models outperform other approaches on the classification tasks, delivering strong performance from limited labeled data and with fewer model parameter updates required.
arXiv Detail & Related papers (2024-03-28T19:31:32Z) - BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models [56.89958793648104]
Large Language Models (LLMs) are versatile and capable of addressing a diverse range of tasks.
Previous approaches either conduct continuous pre-training with domain-specific data or employ retrieval augmentation to support general LLMs.
We present a novel framework named BLADE, which enhances Black-box LArge language models with small Domain-spEcific models.
arXiv Detail & Related papers (2024-03-27T08:57:21Z) - AcroFOD: An Adaptive Method for Cross-domain Few-shot Object Detection [59.10314662986463]
Cross-domain few-shot object detection aims to adapt object detectors in the target domain with a few annotated target data.
The proposed method achieves state-of-the-art performance on multiple benchmarks.
arXiv Detail & Related papers (2022-09-22T10:23:40Z) - Incremental Learning Meets Transfer Learning: Application to Multi-site
Prostate MRI Segmentation [16.50535949349874]
We propose a novel multi-site segmentation framework called incremental-transfer learning (ITL)
ITL learns a model from multi-site datasets in an end-to-end sequential fashion.
We show for the first time that leveraging our ITL training scheme is able to alleviate challenging catastrophic problems in incremental learning.
arXiv Detail & Related papers (2022-06-03T02:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.