Automatic Annotation of Grammaticality in Child-Caregiver Conversations
- URL: http://arxiv.org/abs/2403.14208v1
- Date: Thu, 21 Mar 2024 08:00:05 GMT
- Title: Automatic Annotation of Grammaticality in Child-Caregiver Conversations
- Authors: Mitja Nikolaus, Abhishek Agrawal, Petros Kaklamanis, Alex Warstadt, Abdellah Fourtassi,
- Abstract summary: This work contributes to the growing literature on applying state-of-the-art NLP methods to help study child language acquisition at scale.
We propose a coding scheme for context-dependent grammaticality and annotate more than 4,000 utterances from a large corpus of transcribed conversations.
Our results show that fine-tuned Transformer-based models perform best, achieving human inter-annotation agreement levels.
- Score: 7.493963534076502
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The acquisition of grammar has been a central question to adjudicate between theories of language acquisition. In order to conduct faster, more reproducible, and larger-scale corpus studies on grammaticality in child-caregiver conversations, tools for automatic annotation can offer an effective alternative to tedious manual annotation. We propose a coding scheme for context-dependent grammaticality in child-caregiver conversations and annotate more than 4,000 utterances from a large corpus of transcribed conversations. Based on these annotations, we train and evaluate a range of NLP models. Our results show that fine-tuned Transformer-based models perform best, achieving human inter-annotation agreement levels.As a first application and sanity check of this tool, we use the trained models to annotate a corpus almost two orders of magnitude larger than the manually annotated data and verify that children's grammaticality shows a steady increase with age.This work contributes to the growing literature on applying state-of-the-art NLP methods to help study child language acquisition at scale.
Related papers
- Large corpora and large language models: a replicable method for automating grammatical annotation [0.0]
We introduce a methodological pipeline applied to the case study of formal variation in the English evaluative verb construction 'consider X (as) (to be) Y'
We reach a model accuracy of over 90% on our held-out test samples with only a small amount of training data.
We discuss the generalisability of our results for a wider range of case studies of grammatical constructions and grammatical variation and change.
arXiv Detail & Related papers (2024-11-18T03:29:48Z) - Exploring Automated Keyword Mnemonics Generation with Large Language Models via Overgenerate-and-Rank [4.383205675898942]
Keywords mnemonics are a technique for memorizing vocabulary through memorable associations with a target word via a verbal cue.
We propose a novel overgenerate-and-rank method via prompting large language models to generate verbal cues.
Results show that LLM-generated mnemonics are comparable to human-generated ones in terms of imageability, coherence, and perceived usefulness.
arXiv Detail & Related papers (2024-09-21T00:00:18Z) - Is Child-Directed Speech Effective Training Data for Language Models? [34.46268640655943]
We train GPT-2 and RoBERTa models on 29M words of English child-directed speech.
We test whether the global developmental ordering or the local discourse ordering of children's training data supports high performance relative to other datasets.
These findings support the hypothesis that, rather than proceeding from better data, the child's learning algorithm is substantially more data-efficient than current language modeling techniques.
arXiv Detail & Related papers (2024-08-07T08:18:51Z) - A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - BabySLM: language-acquisition-friendly benchmark of self-supervised
spoken language models [56.93604813379634]
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels.
We propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels.
We highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
arXiv Detail & Related papers (2023-06-02T12:54:38Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Training Effective Neural Sentence Encoders from Automatically Mined
Paraphrases [0.0]
We propose a method for training effective language-specific sentence encoders without manually labeled data.
Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora.
Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks.
arXiv Detail & Related papers (2022-07-26T09:08:56Z) - Towards Generalized Models for Task-oriented Dialogue Modeling on Spoken
Conversations [22.894541507068933]
This paper presents our approach to build generalized models for the Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations Challenge of DSTC-10.
We employ extensive data augmentation strategies on written data, including artificial error injection and round-trip text-speech transformation.
Our approach ranks third on the objective evaluation and second on the final official human evaluation.
arXiv Detail & Related papers (2022-03-08T12:26:57Z) - Syntax-Enhanced Pre-trained Model [49.1659635460369]
We study the problem of leveraging the syntactic structure of text to enhance pre-trained models such as BERT and RoBERTa.
Existing methods utilize syntax of text either in the pre-training stage or in the fine-tuning stage, so that they suffer from discrepancy between the two stages.
We present a model that utilizes the syntax of text in both pre-training and fine-tuning stages.
arXiv Detail & Related papers (2020-12-28T06:48:04Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.