How much pretraining data do language models need to learn syntax?
- URL: http://arxiv.org/abs/2109.03160v2
- Date: Thu, 9 Sep 2021 15:59:00 GMT
- Title: How much pretraining data do language models need to learn syntax?
- Authors: Laura P\'erez-Mayos, Miguel Ballesteros, Leo Wanner
- Abstract summary: Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks.
We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
- Score: 12.668478784932878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers-based pretrained language models achieve outstanding results in
many well-known NLU benchmarks. However, while pretraining methods are very
convenient, they are expensive in terms of time and resources. This calls for a
study of the impact of pretraining data size on the knowledge of the models. We
explore this impact on the syntactic capabilities of RoBERTa, using models
trained on incremental sizes of raw text data. First, we use syntactic
structural probes to determine whether models pretrained on more data encode a
higher amount of syntactic information. Second, we perform a targeted syntactic
evaluation to analyze the impact of pretraining data size on the syntactic
generalization performance of the models. Third, we compare the performance of
the different models on three downstream applications: part-of-speech tagging,
dependency parsing and paraphrase identification. We complement our study with
an analysis of the cost-benefit trade-off of training such models. Our
experiments show that while models pretrained on more data encode more
syntactic knowledge and perform better on downstream applications, they do not
always offer a better performance across the different syntactic phenomena and
come at a higher financial and environmental cost.
Related papers
- Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training [12.29061850090405]
We build upon previous work by replicating existing results on C4 and extending them with our optimized rephrasing pipeline.
Our pipeline leads to increased performance on standard evaluation benchmarks in both the mono- and multilingual setup.
arXiv Detail & Related papers (2024-10-28T07:30:05Z) - Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z) - Multi-Scales Data Augmentation Approach In Natural Language Inference
For Artifacts Mitigation And Pre-Trained Model Optimization [0.0]
We provide a variety of techniques for analyzing and locating dataset artifacts inside the crowdsourced Stanford Natural Language Inference corpus.
To mitigate dataset artifacts, we employ a unique multi-scale data augmentation technique with two distinct frameworks.
Our combination method enhances our model's resistance to perturbation testing, enabling it to continuously outperform the pre-trained baseline.
arXiv Detail & Related papers (2022-12-16T23:37:44Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Tracing Origins: Coref-aware Machine Reading Comprehension [43.352833140317486]
We imitated the human's reading process in connecting the anaphoric expressions and leverage the coreference information to enhance the word embeddings from the pre-trained model.
We demonstrated that the explicit incorporation of the coreference information in fine-tuning stage performed better than the incorporation of the coreference information in training a pre-trained language models.
arXiv Detail & Related papers (2021-10-15T09:28:35Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Training Data Leakage Analysis in Language Models [6.843491191969066]
We introduce a methodology that investigates identifying the user content in the training data that could be leaked under a strong and realistic threat model.
We propose two metrics to quantify user-level data leakage by measuring a model's ability to produce unique sentence fragments within training data.
arXiv Detail & Related papers (2021-01-14T00:57:32Z) - Syntax-Enhanced Pre-trained Model [49.1659635460369]
We study the problem of leveraging the syntactic structure of text to enhance pre-trained models such as BERT and RoBERTa.
Existing methods utilize syntax of text either in the pre-training stage or in the fine-tuning stage, so that they suffer from discrepancy between the two stages.
We present a model that utilizes the syntax of text in both pre-training and fine-tuning stages.
arXiv Detail & Related papers (2020-12-28T06:48:04Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.