Quality > Quantity: Synthetic Corpora from Foundation Models for
Closed-Domain Extractive Question Answering
- URL: http://arxiv.org/abs/2310.16995v1
- Date: Wed, 25 Oct 2023 20:48:16 GMT
- Title: Quality > Quantity: Synthetic Corpora from Foundation Models for
Closed-Domain Extractive Question Answering
- Authors: Saptarshi Sengupta, Connor Heaton, Shreya Ghosh, Preslav Nakov,
Prasenjit Mitra
- Abstract summary: We study extractive question answering within closed domains and introduce the concept of targeted pre-training.
Our proposed framework uses Galactica to generate synthetic, targeted'' corpora that align with specific writing styles and topics.
- Score: 35.38140071573828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Domain adaptation, the process of training a model in one domain and applying
it to another, has been extensively explored in machine learning. While
training a domain-specific foundation model (FM) from scratch is an option,
recent methods have focused on adapting pre-trained FMs for domain-specific
tasks. However, our experiments reveal that either approach does not
consistently achieve state-of-the-art (SOTA) results in the target domain. In
this work, we study extractive question answering within closed domains and
introduce the concept of targeted pre-training. This involves determining and
generating relevant data to further pre-train our models, as opposed to the
conventional philosophy of utilizing domain-specific FMs trained on a wide
range of data. Our proposed framework uses Galactica to generate synthetic,
``targeted'' corpora that align with specific writing styles and topics, such
as research papers and radiology reports. This process can be viewed as a form
of knowledge distillation. We apply our method to two biomedical extractive
question answering datasets, COVID-QA and RadQA, achieving a new benchmark on
the former and demonstrating overall improvements on the latter. Code available
at https://github.com/saptarshi059/CDQA-v1-Targetted-PreTraining/tree/main.
Related papers
- New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration [49.180693704510006]
Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding.
We introduce a new REC dataset with two key features. First, it is designed with controllable difficulty levels, requiring fine-grained reasoning across object categories, attributes, and relationships.
Second, it incorporates negative text and images generated through fine-grained editing, explicitly testing a model's ability to reject non-existent targets.
arXiv Detail & Related papers (2025-02-27T13:58:44Z) - Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.
However, they still struggle with problems requiring multi-step decision-making and environmental feedback.
We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - On Domain-Specific Post-Training for Multimodal Large Language Models [72.67107077850939]
This paper systematically investigates domain adaptation of MLLMs through post-training.
We focus on data synthesis, training pipelines, and task evaluation.
We conduct experiments in high-impact domains such as biomedicine, food, and remote sensing.
arXiv Detail & Related papers (2024-11-29T18:42:28Z) - Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift.
We devise a series of experiments to explain the performance gap empirically.
arXiv Detail & Related papers (2024-09-27T05:06:43Z) - Large Language Models Are Self-Taught Reasoners: Enhancing LLM Applications via Tailored Problem-Solving Demonstrations [4.207253227315905]
We present SELF-TAUGHT, a problem-solving framework, which facilitates customized demonstrations.
In 15 tasks of multiple-choice questions, SELF-TAUGHT achieves superior performance to strong baselines.
We conduct comprehensive analyses on SELF-TAUGHT, including its generalizability to existing prompting methods.
arXiv Detail & Related papers (2024-08-22T11:41:35Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Adapting to Distribution Shift by Visual Domain Prompt Generation [34.19066857066073]
We adapt a model at test-time using a few unlabeled data to address distribution shifts.
We build a knowledge bank to learn the transferable knowledge from source domains.
The proposed method outperforms previous work on 5 large-scale benchmarks including WILDS and DomainNet.
arXiv Detail & Related papers (2024-05-05T02:44:04Z) - Developing Healthcare Language Model Embedding Spaces [0.20971479389679337]
Pre-trained Large Language Models (LLMs) often struggle on out-of-domain datasets like healthcare focused text.
Three methods are assessed: traditional masked language modeling, Deep Contrastive Learning for Unsupervised Textual Representations (DeCLUTR) and a novel pre-training objective utilizing metadata categories from the healthcare settings.
Contrastively trained models outperform other approaches on the classification tasks, delivering strong performance from limited labeled data and with fewer model parameter updates required.
arXiv Detail & Related papers (2024-03-28T19:31:32Z) - DG-TTA: Out-of-domain medical image segmentation through Domain Generalization and Test-Time Adaptation [43.842694540544194]
We propose to combine domain generalization and test-time adaptation to create a highly effective approach for reusing pre-trained models in unseen target domains.
We demonstrate that our method, combined with pre-trained whole-body CT models, can effectively segment MR images with high accuracy.
arXiv Detail & Related papers (2023-12-11T10:26:21Z) - AdAM: Few-Shot Image Generation via Adaptation-Aware Kernel Modulation [71.58154388819887]
Few-shot image generation (F SIG) aims to generate new and diverse images given few (e.g., 10) training samples.
Recent work has addressed F SIG by leveraging a GAN pre-trained on a large-scale source domain and adapting it to the target domain with few target samples.
We propose Adaptation-Aware kernel Modulation (AdAM) for general F SIG of different source-target domain proximity.
arXiv Detail & Related papers (2023-07-04T03:56:43Z) - Improving Domain Generalization with Domain Relations [77.63345406973097]
This paper focuses on domain shifts, which occur when the model is applied to new domains that are different from the ones it was trained on.
We propose a new approach called D$3$G to learn domain-specific models.
Our results show that D$3$G consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-02-06T08:11:16Z) - Few-shot Image Generation via Adaptation-Aware Kernel Modulation [33.191479192580275]
Few-shot image generation (F SIG) aims to generate new and diverse samples given an extremely limited number of samples from a domain.
Recent work has addressed the problem using transfer learning approach, leveraging a GAN pretrained on a large-scale source domain dataset.
We propose Adaptation-Aware kernel Modulation (AdAM) to address general F SIG of different source-target domain proximity.
arXiv Detail & Related papers (2022-10-29T10:26:40Z) - Meta-DMoE: Adapting to Domain Shift by Meta-Distillation from
Mixture-of-Experts [33.21435044949033]
Most existing methods perform training on multiple source domains using a single model.
We propose a novel framework for unsupervised test-time adaptation, which is formulated as a knowledge distillation process.
arXiv Detail & Related papers (2022-10-08T02:28:10Z) - Prior Knowledge Guided Unsupervised Domain Adaptation [82.9977759320565]
We propose a Knowledge-guided Unsupervised Domain Adaptation (KUDA) setting where prior knowledge about the target class distribution is available.
In particular, we consider two specific types of prior knowledge about the class distribution in the target domain: Unary Bound and Binary Relationship.
We propose a rectification module that uses such prior knowledge to refine model generated pseudo labels.
arXiv Detail & Related papers (2022-07-18T18:41:36Z) - Incremental Learning Meets Transfer Learning: Application to Multi-site
Prostate MRI Segmentation [16.50535949349874]
We propose a novel multi-site segmentation framework called incremental-transfer learning (ITL)
ITL learns a model from multi-site datasets in an end-to-end sequential fashion.
We show for the first time that leveraging our ITL training scheme is able to alleviate challenging catastrophic problems in incremental learning.
arXiv Detail & Related papers (2022-06-03T02:32:01Z) - Cross Domain Few-Shot Learning via Meta Adversarial Training [34.383449283927014]
Few-shot relation classification (RC) is one of the critical problems in machine learning.
We present a novel model that takes into consideration the afore-mentioned cross-domain situation.
A meta-based adversarial training framework is proposed to fine-tune the trained networks for adapting to data from the target domain.
arXiv Detail & Related papers (2022-02-11T15:52:29Z) - Unified Instance and Knowledge Alignment Pretraining for Aspect-based
Sentiment Analysis [96.53859361560505]
Aspect-based Sentiment Analysis (ABSA) aims to determine the sentiment polarity towards an aspect.
There always exists severe domain shift between the pretraining and downstream ABSA datasets.
We introduce a unified alignment pretraining framework into the vanilla pretrain-finetune pipeline.
arXiv Detail & Related papers (2021-10-26T04:03:45Z) - Meta-FDMixup: Cross-Domain Few-Shot Learning Guided by Labeled Target
Data [95.47859525676246]
A recent study finds that existing few-shot learning methods, trained on the source domain, fail to generalize to the novel target domain when a domain gap is observed.
In this paper, we realize that the labeled target data in Cross-Domain Few-Shot Learning has not been leveraged in any way to help the learning process.
arXiv Detail & Related papers (2021-07-26T06:15:45Z) - Source-Free Open Compound Domain Adaptation in Semantic Segmentation [99.82890571842603]
In SF-OCDA, only the source pre-trained model and the target data are available to learn the target model.
We propose the Cross-Patch Style Swap (CPSS) to diversify samples with various patch styles in the feature-level.
Our method produces state-of-the-art results on the C-Driving dataset.
arXiv Detail & Related papers (2021-06-07T08:38:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.