Urdu Morphology, Orthography and Lexicon Extraction
- URL: http://arxiv.org/abs/2204.03071v1
- Date: Wed, 6 Apr 2022 20:14:01 GMT
- Title: Urdu Morphology, Orthography and Lexicon Extraction
- Authors: Muhammad Humayoun and Harald Hammarstr\"om and Aarne Ranta
- Abstract summary: This paper describes an implementation of the Urdu language as a software API.
We deal with orthography, morphology and the extraction of the lexicon.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Urdu is a challenging language because of, first, its Perso-Arabic script and
second, its morphological system having inherent grammatical forms and
vocabulary of Arabic, Persian and the native languages of South Asia. This
paper describes an implementation of the Urdu language as a software API, and
we deal with orthography, morphology and the extraction of the lexicon. The
morphology is implemented in a toolkit called Functional Morphology (Forsberg &
Ranta, 2004), which is based on the idea of dealing grammars as software
libraries. Therefore this implementation could be reused in applications such
as intelligent search of keywords, language training and infrastructure for
syntax. We also present an implementation of a small part of Urdu syntax to
demonstrate this reusability.
Related papers
- MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Morphology and Syntax of the Tamil Language [0.0]
The paper highlights the complexity and richness of Tamil in terms of its morphological and syntactic features.
It is proven as a rule-based morphological analyser cum generator and a computational grammar for Tamil have already been developed based on this paper.
arXiv Detail & Related papers (2024-01-16T13:52:25Z) - Teacher Perception of Automatically Extracted Grammar Concepts for L2
Language Learning [66.79173000135717]
We apply this work to teaching two Indian languages, Kannada and Marathi, which do not have well-developed resources for second language learning.
We extract descriptions from a natural text corpus that answer questions about morphosyntax (learning of word order, agreement, case marking, or word formation) and semantics (learning of vocabulary).
We enlist the help of language educators from schools in North America to perform a manual evaluation, who find the materials have potential to be used for their lesson preparation and learner evaluation.
arXiv Detail & Related papers (2023-10-27T18:17:29Z) - PALI: A Language Identification Benchmark for Perso-Arabic Scripts [30.99179028187252]
This paper sheds light on the challenges of detecting languages using Perso-Arabic scripts.
We use a set of supervised techniques to classify sentences into their languages.
We also propose a hierarchical model that targets clusters of languages that are more often confused.
arXiv Detail & Related papers (2023-04-03T19:40:14Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - AUTOLEX: An Automatic Framework for Linguistic Exploration [93.89709486642666]
We propose an automatic framework that aims to ease linguists' discovery and extraction of concise descriptions of linguistic phenomena.
Specifically, we apply this framework to extract descriptions for three phenomena: morphological agreement, case marking, and word order.
We evaluate the descriptions with the help of language experts and propose a method for automated evaluation when human evaluation is infeasible.
arXiv Detail & Related papers (2022-03-25T20:37:30Z) - Morpheme Boundary Detection & Grammatical Feature Prediction for
Gujarati : Dataset & Model [0.0]
We have used a Bi-Directional LSTM based approach to perform morpheme boundary detection and grammatical feature tagging.
This is the first dataset and morph analyzer model for the Gujarati language which performs both grammatical feature tagging and morpheme boundary detection tasks.
arXiv Detail & Related papers (2021-12-18T06:58:36Z) - Co-occurrences using Fasttext embeddings for word similarity tasks in
Urdu [0.0]
This paper builds a corpus for Urdu by scraping and integrating data from various sources.
We modify fasttext embeddings and N-Grams models to enable training them on our built corpus.
We have used these trained embeddings for a word similarity task and compared the results with existing techniques.
arXiv Detail & Related papers (2021-02-22T12:56:26Z) - HinFlair: pre-trained contextual string embeddings for pos tagging and
text classification in the Hindi language [0.0]
HinFlair is a language representation model (contextual string embeddings) pre-trained on a large monolingual Hindi corpus.
Results show that HinFlair outperforms previous state-of-the-art publicly available pre-trained embeddings for downstream tasks like text classification and pos tagging.
arXiv Detail & Related papers (2021-01-18T09:23:35Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.