Bootstrapping Techniques for Polysynthetic Morphological Analysis
- URL: http://arxiv.org/abs/2005.00956v1
- Date: Sun, 3 May 2020 00:35:19 GMT
- Title: Bootstrapping Techniques for Polysynthetic Morphological Analysis
- Authors: William Lane and Steven Bird
- Abstract summary: We offer linguistically-informed approaches for bootstrapping a neural morphological analyzer.
We generate data from a finite state transducer to train an encoder-decoder model.
We improve the model by "hallucinating" missing linguistic structure into the training data, and by resampling from a Zipf distribution to simulate a more natural distribution of morphemes.
- Score: 9.655349059913888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Polysynthetic languages have exceptionally large and sparse vocabularies,
thanks to the number of morpheme slots and combinations in a word. This
complexity, together with a general scarcity of written data, poses a challenge
to the development of natural language technologies. To address this challenge,
we offer linguistically-informed approaches for bootstrapping a neural
morphological analyzer, and demonstrate its application to Kunwinjku, a
polysynthetic Australian language. We generate data from a finite state
transducer to train an encoder-decoder model. We improve the model by
"hallucinating" missing linguistic structure into the training data, and by
resampling from a Zipf distribution to simulate a more natural distribution of
morphemes. The best model accounts for all instances of reduplication in the
test set and achieves an accuracy of 94.7% overall, a 10 percentage point
improvement over the FST baseline. This process demonstrates the feasibility of
bootstrapping a neural morph analyzer from minimal resources.
Related papers
- In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - Synthetic Pre-Training Tasks for Neural Machine Translation [16.6378815054841]
Our goal is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources.
We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge.
Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data.
arXiv Detail & Related papers (2022-12-19T21:34:00Z) - Multi-Scales Data Augmentation Approach In Natural Language Inference
For Artifacts Mitigation And Pre-Trained Model Optimization [0.0]
We provide a variety of techniques for analyzing and locating dataset artifacts inside the crowdsourced Stanford Natural Language Inference corpus.
To mitigate dataset artifacts, we employ a unique multi-scale data augmentation technique with two distinct frameworks.
Our combination method enhances our model's resistance to perturbation testing, enabling it to continuously outperform the pre-trained baseline.
arXiv Detail & Related papers (2022-12-16T23:37:44Z) - Syntax-informed Question Answering with Heterogeneous Graph Transformer [2.139714421848487]
We present a linguistics-informed question answering approach that extends and fine-tunes a pre-trained neural language model.
We illustrate the approach by the addition of syntactic information in the form of dependency and constituency graphic structures connecting tokens and virtual tokens.
arXiv Detail & Related papers (2022-04-01T07:48:03Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Exploiting Language Model for Efficient Linguistic Steganalysis: An
Empirical Study [23.311007481830647]
We present two methods to efficient linguistic steganalysis.
One is to pre-train a language model based on RNN, and the other is to pre-train a sequence autoencoder.
arXiv Detail & Related papers (2021-07-26T12:37:18Z) - Unnatural Language Inference [48.45003475966808]
We find that state-of-the-art NLI models, such as RoBERTa and BART, are invariant to, and sometimes even perform better on, examples with randomly reordered words.
Our findings call into question the idea that our natural language understanding models, and the tasks used for measuring their progress, genuinely require a human-like understanding of syntax.
arXiv Detail & Related papers (2020-12-30T20:40:48Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z) - Unnatural Language Processing: Bridging the Gap Between Synthetic and
Natural Language Data [37.542036032277466]
We introduce a technique for -simulation-to-real'' transfer in language understanding problems.
Our approach matches or outperforms state-of-the-art models trained on natural language data in several domains.
arXiv Detail & Related papers (2020-04-28T16:41:00Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.