Comparing Spoken Languages using Paninian System of Sounds and Finite State Machines
- URL: http://arxiv.org/abs/2301.12463v3
- Date: Fri, 11 Jul 2025 12:49:25 GMT
- Title: Comparing Spoken Languages using Paninian System of Sounds and Finite State Machines
- Authors: Shreekanth M Prabhu, Abhisek Midya,
- Abstract summary: We propose an Ecosystem Model for Linguistic Development with Sanskrit at the core.<n>We represent words across languages as state transitions on the phonetic map and construct corresponding Morphological Finite Automata.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The study of spoken languages comprises phonology, morphology, and grammar. The languages can be classified as root languages, inflectional languages, and stem languages. In addition, languages continually change over time and space by picking isoglosses, as speakers move from region to/through region. All these factors lead to the formation of vocabulary, which has commonality/similarity across languages as well as distinct and subtle differences among them. Comparison of vocabularies across languages and detailed analysis has led to the hypothesis of language families. In particular, in the view of Western linguists, Vedic Sanskrit is a daughter language, part of the Indo-Iranian branch of the Indo-European Language family, and Dravidian Languages belong to an entirely different family. These and such conclusions are reexamined in this paper. Based on our study and analysis, we propose an Ecosystem Model for Linguistic Development with Sanskrit at the core, in place of the widely accepted family tree model. To that end, we leverage the Paninian system of sounds to construct a phonetic map. Then we represent words across languages as state transitions on the phonetic map and construct corresponding Morphological Finite Automata (MFA) that accept groups of words. Regardless of whether the contribution of this paper is significant or minor, it is an important step in challenging policy-driven research that has plagued this field.
Related papers
- Using Information Theory to Characterize Prosodic Typology: The Case of Tone, Pitch-Accent and Stress-Accent [22.63155507847401]
We predict that languages that use prosody to make lexical distinctions should exhibit a higher mutual information between word identity and prosody, compared to languages that don't.<n>We use a dataset of speakers reading sentences aloud in ten languages across five language families to estimate the mutual information between the text and their pitch curves.
arXiv Detail & Related papers (2025-05-12T15:25:17Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - A Computational Model for the Assessment of Mutual Intelligibility Among
Closely Related Languages [1.5773159234875098]
Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it.
Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments.
We propose a computer-assisted method using the Linear Discriminative Learner to approximate the cognitive processes by which humans learn languages.
arXiv Detail & Related papers (2024-02-05T11:32:13Z) - Patterns of Persistence and Diffusibility across the World's Languages [3.7055269158186874]
Colexification is a type of similarity where a single lexical form is used to convey multiple meanings.
We shed light on the linguistic causes of cross-lingual similarity in colexification and phonology.
We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages.
arXiv Detail & Related papers (2024-01-03T12:05:38Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Multilingual context-based pronunciation learning for Text-to-Speech [13.941800219395757]
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end.
We showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules.
We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
arXiv Detail & Related papers (2023-07-31T14:29:06Z) - From Word Models to World Models: Translating from Natural Language to
the Probabilistic Language of Thought [124.40905824051079]
We propose rational meaning construction, a computational framework for language-informed thinking.
We frame linguistic meaning as a context-sensitive mapping from natural language into a probabilistic language of thought.
We show that LLMs can generate context-sensitive translations that capture pragmatically-appropriate linguistic meanings.
We extend our framework to integrate cognitively-motivated symbolic modules.
arXiv Detail & Related papers (2023-06-22T05:14:00Z) - Improve Bilingual TTS Using Dynamic Language and Phonology Embedding [10.244215079409797]
This paper builds a Mandarin-English TTS system to acquire more standard spoken English speech from a monolingual Chinese speaker.
We specially design an embedding strength modulator to capture the dynamic strength of language and phonology.
arXiv Detail & Related papers (2022-12-07T03:46:18Z) - AUTOLEX: An Automatic Framework for Linguistic Exploration [93.89709486642666]
We propose an automatic framework that aims to ease linguists' discovery and extraction of concise descriptions of linguistic phenomena.
Specifically, we apply this framework to extract descriptions for three phenomena: morphological agreement, case marking, and word order.
We evaluate the descriptions with the help of language experts and propose a method for automated evaluation when human evaluation is infeasible.
arXiv Detail & Related papers (2022-03-25T20:37:30Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Differentiable Allophone Graphs for Language-Universal Speech
Recognition [77.2981317283029]
Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages.
We present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings.
We build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language.
arXiv Detail & Related papers (2021-07-24T15:09:32Z) - Rediscovering the Slavic Continuum in Representations Emerging from
Neural Models of Spoken Language Identification [16.369477141866405]
We present a neural model for Slavic language identification in speech signals.
We analyze its emergent representations to investigate whether they reflect objective measures of language relatedness.
arXiv Detail & Related papers (2020-10-22T18:18:19Z) - Deciphering Undersegmented Ancient Scripts Using Phonetic Prior [31.707254394215283]
Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges.
We propose a model that handles both of these challenges by building on rich linguistic constraints.
We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian)
arXiv Detail & Related papers (2020-10-21T15:03:52Z) - Neural Polysynthetic Language Modelling [15.257624461339867]
In high-resource languages, a common approach is to treat morphologically-distinct variants of a common root as completely independent word types.
This assumes, that there are limited inflections per root, and that the majority will appear in a large enough corpus.
We examine the current state-of-the-art in language modelling, machine translation, and text prediction for four polysynthetic languages.
arXiv Detail & Related papers (2020-05-11T22:57:04Z) - Synchronous Bidirectional Learning for Multilingual Lip Reading [99.14744013265594]
Lip movements of all languages share similar patterns due to the common structures of human organs.
Phonemes are more closely related with the lip movements than the alphabet letters.
A novel SBL block is proposed to learn the rules for each language in a fill-in-the-blank way.
arXiv Detail & Related papers (2020-05-08T04:19:57Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.