Information-Theoretic Characterization of Vowel Harmony: A
Cross-Linguistic Study on Word Lists
- URL: http://arxiv.org/abs/2308.04885v1
- Date: Wed, 9 Aug 2023 11:32:16 GMT
- Title: Information-Theoretic Characterization of Vowel Harmony: A
Cross-Linguistic Study on Word Lists
- Authors: Julius Steuer and Badr Abdullah and Johann-Mattis List and Dietrich
Klakow
- Abstract summary: We define an information-theoretic measure of harmonicity based on predictability of vowels in a natural language lexicon.
We estimate this harmonicity using phoneme-level language models (PLMs)
Our work demonstrates that word lists are a valuable resource for typological research.
- Score: 18.138642719651994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a cross-linguistic study that aims to quantify vowel harmony using
data-driven computational modeling. Concretely, we define an
information-theoretic measure of harmonicity based on the predictability of
vowels in a natural language lexicon, which we estimate using phoneme-level
language models (PLMs). Prior quantitative studies have relied heavily on
inflected word-forms in the analysis of vowel harmony. We instead train our
models using cross-linguistically comparable lemma forms with little or no
inflection, which enables us to cover more under-studied languages. Training
data for our PLMs consists of word lists with a maximum of 1000 entries per
language. Despite the fact that the data we employ are substantially smaller
than previously used corpora, our experiments demonstrate the neural PLMs
capture vowel harmony patterns in a set of languages that exhibit this
phenomenon. Our work also demonstrates that word lists are a valuable resource
for typological research, and offers new possibilities for future studies on
low-resource, under-studied languages.
Related papers
- Deciphering Assamese Vowel Harmony with Featural InfoWaveGAN [2.495922096144971]
We investigate the potential of the Featural InfoWaveGAN model to learn iterative long-distance vowel harmony using raw speech data.
We focus on Assamese, a language known for its phonologically regressive and word-bound vowel harmony.
We demonstrate that the model is adept at grasping the intricacies of Assamese phonotactics.
arXiv Detail & Related papers (2024-07-09T05:01:13Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - PhonologyBench: Evaluating Phonological Skills of Large Language Models [57.80997670335227]
Phonology, the study of speech's structure and pronunciation rules, is a critical yet often overlooked component in Large Language Model (LLM) research.
We present PhonologyBench, a novel benchmark consisting of three diagnostic tasks designed to explicitly test the phonological skills of LLMs.
We observe a significant gap of 17% and 45% on Rhyme Word Generation and Syllable counting, respectively, when compared to humans.
arXiv Detail & Related papers (2024-04-03T04:53:14Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic
Word Embeddings [19.195728241989702]
We propose a multi-task learning model that incorporates top-down lexical knowledge into the training procedure of acoustic word embeddings.
We experiment with three languages and demonstrate that incorporating lexical knowledge improves the embedding space discriminability.
arXiv Detail & Related papers (2022-09-14T13:33:04Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z) - Analyzing autoencoder-based acoustic word embeddings [37.78342106714364]
Acoustic word embeddings (AWEs) are representations of words which encode their acoustic features.
We analyze basic properties of AWE spaces learned by a sequence-to-sequence encoder-decoder model in six typologically diverse languages.
AWEs exhibit a word onset bias, similar to patterns reported in various studies on human speech processing and lexical access.
arXiv Detail & Related papers (2020-04-03T16:11:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.