Automatic Extraction of Bengali Root Verbs using Paninian Grammar
- URL: http://arxiv.org/abs/2004.00089v1
- Date: Tue, 31 Mar 2020 20:22:10 GMT
- Title: Automatic Extraction of Bengali Root Verbs using Paninian Grammar
- Authors: Arijit Das, Tapas Halder and Diganta Saha
- Abstract summary: The proposed system has been developed based on tense, person and morphological inflections of the verbs to find their root forms.
The accuracy of the output has been achieved 98% which is verified by a linguistic expert.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this research work, we have proposed an algorithm based on supervised
learning methodology to extract the root forms of the Bengali verbs using the
grammatical rules proposed by Panini [1] in Ashtadhyayi. This methodology can
be applied for the languages which are derived from Sanskrit. The proposed
system has been developed based on tense, person and morphological inflections
of the verbs to find their root forms. The work has been executed in two
phases: first, the surface level forms or inflected forms of the verbs have
been classified into a certain number of groups of similar tense and person.
For this task, a standard pattern, available in Bengali language has been used.
Next, a set of rules have been applied to extract the root form from the
surface level forms of a verb. The system has been tested on 10000 verbs
collected from the Bengali text corpus developed in the TDIL project of the
Govt. of India. The accuracy of the output has been achieved 98% which is
verified by a linguistic expert. Root verb identification is a key step in
semantic searching, multi-sentence search query processing, understanding the
meaning of a language, disambiguation of word sense, classification of the
sentences etc.
Related papers
- BanLemma: A Word Formation Dependent Rule and Dictionary Based Bangla
Lemmatizer [3.1742013359102175]
We propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer for Bangla.
Our system aims to lemmatize words based on their parts of speech class within a given sentence.
The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained.
arXiv Detail & Related papers (2023-11-06T13:02:07Z) - Teacher Perception of Automatically Extracted Grammar Concepts for L2
Language Learning [66.79173000135717]
We apply this work to teaching two Indian languages, Kannada and Marathi, which do not have well-developed resources for second language learning.
We extract descriptions from a natural text corpus that answer questions about morphosyntax (learning of word order, agreement, case marking, or word formation) and semantics (learning of vocabulary).
We enlist the help of language educators from schools in North America to perform a manual evaluation, who find the materials have potential to be used for their lesson preparation and learner evaluation.
arXiv Detail & Related papers (2023-10-27T18:17:29Z) - Plagiarism Detection in the Bengali Language: A Text Similarity-Based
Approach [0.866842899233181]
Plagiarism is not limited to a single language only. Bengali is the most widely spoken language of Bangladesh and the second most spoken language in India.
We have collected Bengali Literature books from the National Digital Library of India and with a comprehensive methodology extracted texts from it and constructed our corpus.
Our experimental results find out average accuracy between 72.10 % - 79.89 % in text extraction using OCR.
We have built a web application for end-user and successfully tested it for Plagiarism detection in Bengali texts.
arXiv Detail & Related papers (2022-03-25T03:11:00Z) - Utilizing Wordnets for Cognate Detection among Indian Languages [50.83320088758705]
We detect cognate word pairs among ten Indian languages with Hindi.
We use deep learning methodologies to predict whether a word pair is cognate or not.
We report improved performance of up to 26%.
arXiv Detail & Related papers (2021-12-30T16:46:28Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and
language Models for Intent Classification [81.80311855996584]
We propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model.
We achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.
arXiv Detail & Related papers (2021-02-15T07:20:06Z) - HinFlair: pre-trained contextual string embeddings for pos tagging and
text classification in the Hindi language [0.0]
HinFlair is a language representation model (contextual string embeddings) pre-trained on a large monolingual Hindi corpus.
Results show that HinFlair outperforms previous state-of-the-art publicly available pre-trained embeddings for downstream tasks like text classification and pos tagging.
arXiv Detail & Related papers (2021-01-18T09:23:35Z) - Language Identification of Devanagari Poems [0.0]
This paper proposes a procedure for automatic language identification of poems for poem analysis task.
It consists of 10 Devanagari based languages of India i.e. Angika, Awadhi, Braj, Bhojpuri, Chhattisgarhi, Garhwali, Haryanvi, Hindi, Magahi, and Maithili.
arXiv Detail & Related papers (2020-12-30T03:36:18Z) - Method of noun phrase detection in Ukrainian texts [0.0]
The investigation of the search for noun phrases within Ukrainian texts are still at an early stage.
The complex method of noun phrases detection in Ukrainian texts utilizing Universal Dependencies means and named-entity recognition model has been suggested.
arXiv Detail & Related papers (2020-10-22T09:20:24Z) - Investigating Cross-Linguistic Adjective Ordering Tendencies with a
Latent-Variable Model [66.84264870118723]
We present the first purely corpus-driven model of multi-lingual adjective ordering in the form of a latent-variable model.
We provide strong converging evidence for the existence of universal, cross-linguistic, hierarchical adjective ordering tendencies.
arXiv Detail & Related papers (2020-10-09T18:27:55Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.