Domain Specific Fine-tuning of Denoising Sequence-to-Sequence Models for
Natural Language Summarization
- URL: http://arxiv.org/abs/2204.09716v1
- Date: Wed, 6 Apr 2022 18:17:14 GMT
- Title: Domain Specific Fine-tuning of Denoising Sequence-to-Sequence Models for
Natural Language Summarization
- Authors: Brydon Parker, Alik Sokolov, Mahtab Ahmed, Matt Kalebic, Sedef Akinli
Kocak, Ofer Shai
- Abstract summary: We explore applications of a state-of-the-art NLP model (BART)
We show that our end-to-end fine-tuning approach can result in a 5-6% absolute ROUGE-1 improvement over an out-of-the-box pre-trained BART summarizer.
- Score: 2.9360071145551068
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Summarization of long-form text data is a problem especially pertinent in
knowledge economy jobs such as medicine and finance, that require continuously
remaining informed on a sophisticated and evolving body of knowledge. As such,
isolating and summarizing key content automatically using Natural Language
Processing (NLP) techniques holds the potential for extensive time savings in
these industries. We explore applications of a state-of-the-art NLP model
(BART), and explore strategies for tuning it to optimal performance using data
augmentation and various fine-tuning strategies. We show that our end-to-end
fine-tuning approach can result in a 5-6\% absolute ROUGE-1 improvement over an
out-of-the-box pre-trained BART summarizer when tested on domain specific data,
and make available our end-to-end pipeline to achieve these results on finance,
medical, or other user-specified domains.
Related papers
- IGOT: Information Gain Optimized Tokenizer on Domain Adaptive Pretraining [2.009700777745832]
Pretrained Large Language Models (LLM) have demonstrated strong capabilities in various fields of natural language generation.
When using generative AI to process downstream tasks, a common approach is to add new knowledge through continued training or fine-tuning.
In this article, we proposed Information Gain Optimized Tokenizer (IGOT) which analyzes the special token set of downstream tasks, constructs a new subset using $phi$ with the special token and its information gain.
We explored the many positive effects of this method's customized tokenizer on domain-adaptive pretraining and verified this method can perform better than the
arXiv Detail & Related papers (2024-05-16T07:25:10Z) - Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse [4.98050508891467]
We propose a two-stage approach for the construction of production prompts designed to yield high-quality data.
This method involves the generation of a diverse array of prompts that encompass a broad spectrum of tasks and exhibit a rich variety of expressions.
We introduce a cost-effective, multi-dimensional quality assessment framework to ensure the integrity of the generated labeling data.
arXiv Detail & Related papers (2024-03-14T08:27:32Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - State Sequences Prediction via Fourier Transform for Representation
Learning [111.82376793413746]
We propose State Sequences Prediction via Fourier Transform (SPF), a novel method for learning expressive representations efficiently.
We theoretically analyze the existence of structural information in state sequences, which is closely related to policy performance and signal regularity.
Experiments demonstrate that the proposed method outperforms several state-of-the-art algorithms in terms of both sample efficiency and performance.
arXiv Detail & Related papers (2023-10-24T14:47:02Z) - Controlled Randomness Improves the Performance of Transformer Models [4.678970068275123]
We introduce controlled randomness, i.e. noise, into the training process to improve fine-tuning language models.
We find that adding such noise can improve the performance in our two downstream tasks of joint named entity recognition and relation extraction and text summarization.
arXiv Detail & Related papers (2023-10-20T14:12:55Z) - FinGPT: Instruction Tuning Benchmark for Open-Source Large Language
Models in Financial Datasets [9.714447724811842]
This paper introduces a distinctive approach anchored in the Instruction Tuning paradigm for open-source large language models.
We capitalize on the interoperability of open-source models, ensuring a seamless and transparent integration.
The paper presents a benchmarking scheme designed for end-to-end training and testing, employing a cost-effective progression.
arXiv Detail & Related papers (2023-10-07T12:52:58Z) - "FIJO": a French Insurance Soft Skill Detection Dataset [0.0]
This article proposes a new public dataset, FIJO, containing insurance job offers, including many soft skill annotations.
We present the results of skill detection algorithms using a named entity recognition approach and show that transformers-based models have good token-wise performances on this dataset.
arXiv Detail & Related papers (2022-04-11T15:54:22Z) - Fine-Tuning Large Neural Language Models for Biomedical Natural Language
Processing [55.52858954615655]
We conduct a systematic study on fine-tuning stability in biomedical NLP.
We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains.
We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications.
arXiv Detail & Related papers (2021-12-15T04:20:35Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.