Linguistic Classification using Instance-Based Learning
- URL: http://arxiv.org/abs/2012.07512v1
- Date: Wed, 2 Dec 2020 04:12:10 GMT
- Title: Linguistic Classification using Instance-Based Learning
- Authors: Priya S. Nayak, Rhythm Girdhar, Shreekanth M. Prabhu
- Abstract summary: We take a contrarian approach and question the tree-based model that is rather restrictive.
For example, the affinity that Sanskrit independently has with languages across Indo-European languages is better illustrated using a network model.
We can say the same about inter-relationship between languages in India, where the inter-relationships are better discovered than assumed.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Traditionally linguists have organized languages of the world as language
families modelled as trees. In this work we take a contrarian approach and
question the tree-based model that is rather restrictive. For example, the
affinity that Sanskrit independently has with languages across Indo-European
languages is better illustrated using a network model. We can say the same
about inter-relationship between languages in India, where the
inter-relationships are better discovered than assumed. To enable such a
discovery, in this paper we have made use of instance-based learning techniques
to assign language labels to words. We vocalize each word and then classify it
by making use of our custom linguistic distance metric of the word relative to
training sets containing language labels. We construct the training sets by
making use of word clusters and assigning a language and category label to that
cluster. Further, we make use of clustering coefficients as a quality metric
for our research. We believe our work has the potential to usher in a new era
in linguistics. We have limited this work for important languages in India.
This work can be further strengthened by applying Adaboost for classification
coupled with structural equivalence concepts of social network analysis.
Related papers
- A Computational Model for the Assessment of Mutual Intelligibility Among
Closely Related Languages [1.5773159234875098]
Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it.
Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments.
We propose a computer-assisted method using the Linear Discriminative Learner to approximate the cognitive processes by which humans learn languages.
arXiv Detail & Related papers (2024-02-05T11:32:13Z) - Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages [3.716965622352967]
We propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers.
Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks.
arXiv Detail & Related papers (2023-05-26T18:06:49Z) - Utilizing Wordnets for Cognate Detection among Indian Languages [50.83320088758705]
We detect cognate word pairs among ten Indian languages with Hindi.
We use deep learning methodologies to predict whether a word pair is cognate or not.
We report improved performance of up to 26%.
arXiv Detail & Related papers (2021-12-30T16:46:28Z) - A Data Bootstrapping Recipe for Low Resource Multilingual Relation
Classification [38.83366564843953]
IndoRE is a dataset with 21K entity and relation tagged gold sentences in three Indian languages, plus English.
We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information.
We study the accuracy efficiency tradeoff between expensive gold instances vs. translated and aligned'silver' instances.
arXiv Detail & Related papers (2021-10-18T18:40:46Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Establishing Interlingua in Multilingual Language Models [0.0]
We show that different languages do converge to a shared space in large multilingual language models.
We extend our analysis to 28 diverse languages and find that the interlingual space exhibits a particular structure similar to the linguistic relatedness of languages.
arXiv Detail & Related papers (2021-09-02T20:53:14Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Markov Chain Monte-Carlo Phylogenetic Inference Construction in
Computational Historical Linguistics [0.0]
More and more languages in the world are under study nowadays, as a result, the traditional way of historical linguistics study is facing some challenges.
In this paper, I am going to use computational method to cluster the languages and use Markov Chain Monte Carlo (MCMC) method to build the language typology relationship tree.
arXiv Detail & Related papers (2020-02-22T06:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.