Syntactic Inductive Bias in Transformer Language Models: Especially
Helpful for Low-Resource Languages?
- URL: http://arxiv.org/abs/2311.00268v1
- Date: Wed, 1 Nov 2023 03:32:46 GMT
- Title: Syntactic Inductive Bias in Transformer Language Models: Especially
Helpful for Low-Resource Languages?
- Authors: Luke Gessler, Nathan Schneider
- Abstract summary: A line of work on Transformer-based language models has attempted to use syntactic inductive bias to enhance the pretraining process.
We investigate whether these methods can compensate for data sparseness in low-resource languages.
We find that these syntactic inductive bias methods produce uneven results in low-resource settings.
- Score: 10.324936426012417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A line of work on Transformer-based language models such as BERT has
attempted to use syntactic inductive bias to enhance the pretraining process,
on the theory that building syntactic structure into the training process
should reduce the amount of data needed for training. But such methods are
often tested for high-resource languages such as English. In this work, we
investigate whether these methods can compensate for data sparseness in
low-resource languages, hypothesizing that they ought to be more effective for
low-resource languages. We experiment with five low-resource languages: Uyghur,
Wolof, Maltese, Coptic, and Ancient Greek. We find that these syntactic
inductive bias methods produce uneven results in low-resource settings, and
provide surprisingly little benefit in most cases.
Related papers
- Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu [53.437954702561065]
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT.
This study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language.
Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.
arXiv Detail & Related papers (2025-02-17T14:53:49Z) - Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.
For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.
We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - Towards Neural No-Resource Language Translation: A Comparative Evaluation of Approaches [0.0]
No-resource languages - those with minimal or no digital representation - pose unique challenges for machine translation (MT)
Unlike low-resource languages, which rely on limited but existent corpora, no-resource languages often have fewer than 100 sentences available for training.
This work explores the problem of no-resource translation through three distinct approaches.
arXiv Detail & Related papers (2024-12-29T21:12:39Z) - Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem [4.830018386227]
This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline.
We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of context retrieved from a constrained database of digitized pedagogical materials and parallel corpora.
arXiv Detail & Related papers (2024-06-21T20:02:22Z) - Transferring BERT Capabilities from High-Resource to Low-Resource
Languages Using Vocabulary Matching [1.746529892290768]
This work presents a novel approach to transfer BERT capabilities from high-resource to low-resource languages using vocabulary matching.
We conduct experiments on the Silesian and Kashubian languages and demonstrate the effectiveness of our approach to improve the performance of BERT models even when the target language has minimal training data.
arXiv Detail & Related papers (2024-02-22T09:49:26Z) - The Best of Both Worlds: Combining Human and Machine Translations for
Multilingual Semantic Parsing with Active Learning [50.320178219081484]
We propose an active learning approach that exploits the strengths of both human and machine translations.
An ideal utterance selection can significantly reduce the error and bias in the translated data.
arXiv Detail & Related papers (2023-05-22T05:57:47Z) - Improving Cross-lingual Information Retrieval on Low-Resource Languages
via Optimal Transport Distillation [21.057178077747754]
In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval.
By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training.
Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages.
arXiv Detail & Related papers (2023-01-29T22:30:36Z) - An Exploration of Data Augmentation Techniques for Improving English to
Tigrinya Translation [21.636157115922693]
An effective method of generating auxiliary data is back-translation of target language sentences.
We present a case study of Tigrinya where we investigate several back-translation methods to generate synthetic source sentences.
arXiv Detail & Related papers (2021-03-31T03:31:09Z) - Emergent Communication Pretraining for Few-Shot Machine Translation [66.48990742411033]
We pretrain neural networks via emergent communication from referential games.
Our key assumption is that grounding communication on images---as a crude approximation of real-world environments---inductively biases the model towards learning natural languages.
arXiv Detail & Related papers (2020-11-02T10:57:53Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.