Language Identification of Devanagari Poems
- URL: http://arxiv.org/abs/2012.15023v1
- Date: Wed, 30 Dec 2020 03:36:18 GMT
- Title: Language Identification of Devanagari Poems
- Authors: Priyankit Acharya, Aditya Ku. Pathak, Rakesh Ch. Balabantaray, and
Anil Ku. Singh
- Abstract summary: This paper proposes a procedure for automatic language identification of poems for poem analysis task.
It consists of 10 Devanagari based languages of India i.e. Angika, Awadhi, Braj, Bhojpuri, Chhattisgarhi, Garhwali, Haryanvi, Hindi, Magahi, and Maithili.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Language Identification is a very important part of several text processing
pipelines. Extensive research has been done in this field. This paper proposes
a procedure for automatic language identification of poems for poem analysis
task, consisting of 10 Devanagari based languages of India i.e. Angika, Awadhi,
Braj, Bhojpuri, Chhattisgarhi, Garhwali, Haryanvi, Hindi, Magahi, and Maithili.
We collated corpora of poems of varying length and studied the similarity of
poems among the 10 languages at the lexical level. Finally, various language
identification systems based on supervised machine learning and deep learning
techniques are applied and evaluated.
Related papers
- Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Wav2Gloss: Generating Interlinear Glossed Text from Speech [78.64412090339044]
We propose Wav2Gloss, a task in which four linguistic annotation components are extracted automatically from speech.
We provide various baselines to lay the groundwork for future research on Interlinear Glossed Text generation from speech.
arXiv Detail & Related papers (2024-03-19T21:45:29Z) - Urdu Poetry Generated by Using Deep Learning Techniques [1.52292571922932]
This study provides Urdu poetry generated using different deep-learning techniques and algorithms.
The data was collected through the Rekhta website, containing 1341 text files with several couplets.
arXiv Detail & Related papers (2023-09-25T15:44:24Z) - Multimodal Modeling For Spoken Language Identification [57.94119986116947]
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance.
We propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification.
arXiv Detail & Related papers (2023-09-19T12:21:39Z) - Aesthetics of Sanskrit Poetry from the Perspective of Computational
Linguistics: A Case Study Analysis on Siksastaka [11.950202012146498]
This article explores the intersection of Sanskrit poetry and computational linguistics.
We propose a roadmap of an interpretable framework to analyze and classify the qualities and characteristics of fine Sanskrit poetry.
We provide a deep analysis of Siksastaka, a Sanskrit poem, from the perspective of 6 prominent kavyashastra schools.
arXiv Detail & Related papers (2023-08-14T11:26:25Z) - ALBERTI, a Multilingual Domain Specific Language Model for Poetry
Analysis [0.0]
We present textscAlberti, the first multilingual pre-trained large language model for poetry.
We further trained multilingual BERT on a corpus of over 12 million verses from 12 languages.
textscAlberti achieves state-of-the-art results for German when compared to rule-based systems.
arXiv Detail & Related papers (2023-07-03T22:50:53Z) - CCPM: A Chinese Classical Poetry Matching Dataset [50.90794811956129]
We propose a novel task to assess a model's semantic understanding of poetry by poem matching.
This task requires the model to select one line of Chinese classical poetry among four candidates according to the modern Chinese translation of a line of poetry.
To construct this dataset, we first obtain a set of parallel data of Chinese classical poetry and modern Chinese translation.
arXiv Detail & Related papers (2021-06-03T16:49:03Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - Anubhuti -- An annotated dataset for emotional analysis of Bengali short
stories [2.3424047967193826]
Anubhuti is the first and largest text corpus for analyzing emotions expressed by writers of Bengali short stories.
We explain the data collection methods, the manual annotation process and the resulting high inter-annotator agreement.
We have verified the performance of our dataset with baseline Machine Learning and a Deep Learning model for emotion classification.
arXiv Detail & Related papers (2020-10-06T22:33:58Z) - A Multilingual Parallel Corpora Collection Effort for Indian Languages [43.62422999765863]
We present sentence aligned parallel corpora across 10 Indian languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English.
The corpora are compiled from online sources which have content shared across languages.
arXiv Detail & Related papers (2020-07-15T14:00:18Z) - Automatic Extraction of Bengali Root Verbs using Paninian Grammar [0.0]
The proposed system has been developed based on tense, person and morphological inflections of the verbs to find their root forms.
The accuracy of the output has been achieved 98% which is verified by a linguistic expert.
arXiv Detail & Related papers (2020-03-31T20:22:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.