Acquiring Linguistic Knowledge from Multimodal Input
- URL: http://arxiv.org/abs/2402.17936v1
- Date: Tue, 27 Feb 2024 23:29:10 GMT
- Title: Acquiring Linguistic Knowledge from Multimodal Input
- Authors: Theodor Amariucai, Alex Warstadt
- Abstract summary: In contrast to children, language models (LMs) exhibit considerably inferior data efficiency when acquiring language.
We test the hypothesis that this data efficiency gap is partly caused by a lack of multimodal input and grounding in the learning environment of typical language models.
- Score: 10.965306219502303
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In contrast to children, language models (LMs) exhibit considerably inferior
data efficiency when acquiring language. In this submission to the BabyLM
Challenge (Warstadt et al., 2023), we test the hypothesis that this data
efficiency gap is partly caused by a lack of multimodal input and grounding in
the learning environment of typical language models. Although previous work
looking into this question found that multimodal training can even harm
language-only performance, we speculate that these findings can be attributed
to catastrophic forgetting of complex language due to fine-tuning on captions
data. To test our hypothesis, we perform an ablation study on FLAVA (Singh et
al., 2022), a multimodal vision-and-language model, independently varying the
volume of text and vision input to quantify how much text data (if any) can be
offset by vision at different data scales. We aim to limit catastrophic
forgetting through a multitask pretraining regime that includes unimodal
text-only tasks and data sampled from WiT, the relatively diverse
Wikipedia-based dataset (Srinivasan et al., 2021). Our results are largely
negative: Multimodal pretraining does not harm our models' language performance
but does not consistently help either. That said, our conclusions are limited
by our having been able to conduct only a small number of runs. While we must
leave open the possibility that multimodal input explains some of the gap in
data efficiency between LMs and humans, positive evidence for this hypothesis
will require better architectures and techniques for multimodal training.
Related papers
- Is Child-Directed Speech Effective Training Data for Language Models? [34.46268640655943]
We train GPT-2 and RoBERTa models on 29M words of English child-directed speech.
We test whether the global developmental ordering or the local discourse ordering of children's training data supports high performance relative to other datasets.
These findings support the hypothesis that, rather than proceeding from better data, the child's learning algorithm is substantially more data-efficient than current language modeling techniques.
arXiv Detail & Related papers (2024-08-07T08:18:51Z) - Mitigating Data Imbalance and Representation Degeneration in
Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model.
We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z) - TextMI: Textualize Multimodal Information for Integrating Non-verbal
Cues in Pre-trained Language Models [5.668457303716451]
We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks.
Our approach significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks.
arXiv Detail & Related papers (2023-03-27T17:54:32Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Cross-lingual Transfer Learning for Check-worthy Claim Identification
over Twitter [7.601937548486356]
Misinformation spread over social media has become an undeniable infodemic.
We present a systematic study of six approaches for cross-lingual check-worthiness estimation across pairs of five diverse languages with the help of Multilingual BERT (mBERT) model.
Our results show that for some language pairs, zero-shot cross-lingual transfer is possible and can perform as good as monolingual models that are trained on the target language.
arXiv Detail & Related papers (2022-11-09T18:18:53Z) - Multilingual Multimodal Learning with Machine Translated Text [27.7207234512674]
We investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data.
We propose two metrics for automatically removing such translations from the resulting datasets.
In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning.
arXiv Detail & Related papers (2022-10-24T11:41:20Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Super-Prompting: Utilizing Model-Independent Contextual Data to Reduce
Data Annotation Required in Visual Commonsense Tasks [3.42658286826597]
We analyze different prompt-based fine-tuning techniques to improve results on both language and multimodal causal transformer models.
Our results show that by simple model-agnostic prompt-based fine-tuning, comparable results can be reached by only using 35%-40% of the fine-tuning training dataset.
arXiv Detail & Related papers (2022-04-25T18:56:55Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.