On the effect of curriculum learning with developmental data for grammar
acquisition
- URL: http://arxiv.org/abs/2311.00128v2
- Date: Fri, 3 Nov 2023 16:42:33 GMT
- Title: On the effect of curriculum learning with developmental data for grammar
acquisition
- Authors: Mattia Opper, J. Morrison, N. Siddharth
- Abstract summary: This work explores the degree to which grammar acquisition is driven by language simplicity' and the source modality (speech vs. text) of data.
We find that grammar acquisition is largely driven by exposure to speech data, and in particular through exposure to two of the BabyLM training corpora: AO-Childes and Open Subtitles.
- Score: 4.4044968357361745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work explores the degree to which grammar acquisition is driven by
language `simplicity' and the source modality (speech vs. text) of data. Using
BabyBERTa as a probe, we find that grammar acquisition is largely driven by
exposure to speech data, and in particular through exposure to two of the
BabyLM training corpora: AO-Childes and Open Subtitles. We arrive at this
finding by examining various ways of presenting input data to our model. First,
we assess the impact of various sequence-level complexity based curricula. We
then examine the impact of learning over `blocks' -- covering spans of text
that are balanced for the number of tokens in each of the source corpora
(rather than number of lines). Finally, we explore curricula that vary the
degree to which the model is exposed to different corpora. In all cases, we
find that over-exposure to AO-Childes and Open Subtitles significantly drives
performance. We verify these findings through a comparable control dataset in
which exposure to these corpora, and speech more generally, is limited by
design. Our findings indicate that it is not the proportion of tokens occupied
by high-utility data that aids acquisition, but rather the proportion of
training steps assigned to such data. We hope this encourages future research
into the use of more developmentally plausible linguistic data (which tends to
be more scarce) to augment general purpose pre-training regimes.
Related papers
- Is Child-Directed Speech Effective Training Data for Language Models? [34.46268640655943]
We train GPT-2 and RoBERTa models on 29M words of English child-directed speech.
We test whether the global developmental ordering or the local discourse ordering of children's training data supports high performance relative to other datasets.
These findings support the hypothesis that, rather than proceeding from better data, the child's learning algorithm is substantially more data-efficient than current language modeling techniques.
arXiv Detail & Related papers (2024-08-07T08:18:51Z) - Reconsidering Sentence-Level Sign Language Translation [2.099922236065961]
We show that for 33% of sentences in our sample, our fluent Deaf signer annotators were only able to understand key parts of the clip in light of discourse-level context.
These results underscore the importance of understanding and sanity checking examples when adapting machine learning to new domains.
arXiv Detail & Related papers (2024-06-16T19:19:54Z) - Enhancing Argument Structure Extraction with Efficient Leverage of
Contextual Information [79.06082391992545]
We propose an Efficient Context-aware model (ECASE) that fully exploits contextual information.
We introduce a sequence-attention module and distance-weighted similarity loss to aggregate contextual information and argumentative information.
Our experiments on five datasets from various domains demonstrate that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-10-08T08:47:10Z) - Studying the impacts of pre-training using ChatGPT-generated text on
downstream tasks [0.0]
Our research aims to investigate the influence of artificial text in the pre-training phase of language models.
We conducted a comparative analysis between a language model, RoBERTa, pre-trained using CNN/DailyMail news articles, and ChatGPT, which employed the same articles for its training.
We demonstrate that the utilization of artificial text during pre-training does not have a significant impact on either the performance of the models in downstream tasks or their gender bias.
arXiv Detail & Related papers (2023-09-02T12:56:15Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Harnessing the Power of Text-image Contrastive Models for Automatic
Detection of Online Misinformation [50.46219766161111]
We develop a self-learning model to explore the constrastive learning in the domain of misinformation identification.
Our model shows the superior performance of non-matched image-text pair detection when the training data is insufficient.
arXiv Detail & Related papers (2023-04-19T02:53:59Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Dependency Induction Through the Lens of Visual Perception [81.91502968815746]
We propose an unsupervised grammar induction model that leverages word concreteness and a structural vision-based to jointly learn constituency-structure and dependency-structure grammars.
Our experiments show that the proposed extension outperforms the current state-of-the-art visually grounded models in constituency parsing even with a smaller grammar size.
arXiv Detail & Related papers (2021-09-20T18:40:37Z) - Subsentence Extraction from Text Using Coverage-Based Deep Learning
Language Models [3.3461339691835277]
We propose a coverage-based sentiment and subsentence extraction system.
The predicted subsentence consists of auxiliary information expressing a sentiment.
Our approach outperforms the state-of-the-art approaches by a large margin in subsentence prediction.
arXiv Detail & Related papers (2021-04-20T06:24:49Z) - Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining.
We distill the approximate marginal distribution over words in context from the syntactic LM.
Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.