Validation and Normalization of DCS corpus using Sanskrit Heritage tools
to build a tagged Gold Corpus
- URL: http://arxiv.org/abs/2005.06545v1
- Date: Wed, 13 May 2020 19:23:43 GMT
- Title: Validation and Normalization of DCS corpus using Sanskrit Heritage tools
to build a tagged Gold Corpus
- Authors: Sriram Krishnan and Amba Kulkarni and G\'erard Huet
- Abstract summary: The Digital Corpus of Sanskrit records around 650,000 sentences along with their morphological and lexical tagging.
The Sanskrit Heritage Engine's Reader produces all possible segmentations with morphological and lexical analyses.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Digital Corpus of Sanskrit records around 650,000 sentences along with
their morphological and lexical tagging. But inconsistencies in morphological
analysis, and in providing crucial information like the segmented word, urges
the need for standardization and validation of this corpus. Automating the
validation process requires efficient analyzers which also provide the missing
information. The Sanskrit Heritage Engine's Reader produces all possible
segmentations with morphological and lexical analyses. Aligning these systems
would help us in recording the linguistic differences, which can be used to
update these systems to produce standardized results and will also provide a
Gold corpus tagged with complete morphological and lexical information along
with the segmented words. Krishna et al. (2017) aligned 115,000 sentences,
considering some of the linguistic differences. As both these systems have
evolved significantly, the alignment is done again considering all the
remaining linguistic differences between these systems. This paper describes
the modified alignment process in detail and records the additional linguistic
differences observed.
Reference: Amrith Krishna, Pavankumar Satuluri, and Pawan Goyal. 2017. A
dataset for Sanskrit word segmentation. In Proceedings of the Joint SIGHUM
Workshop on Computational Linguistics for Cultural Heritage, Social Sciences,
Humanities and Literature, page 105-114. Association for Computational
Linguistics, August.
Related papers
- Holmes: A Benchmark to Assess the Linguistic Competence of Language Models [59.627729608055006]
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence.
We use computation-based probing to examine LMs' internal representations regarding distinct linguistic phenomena.
As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities.
arXiv Detail & Related papers (2024-04-29T17:58:36Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Developing an Informal-Formal Persian Corpus [0.0]
We build a parallel corpus of 50,000 sentence pairs with alignments in the word/phrase level.
The resulting corpus has about 530,000 alignments and a dictionary containing 49,397 word and phrase pairs.
arXiv Detail & Related papers (2023-08-10T04:57:34Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - A Benchmark Corpus and Neural Approach for Sanskrit Derivative Nouns
Analysis [0.755972004983746]
This paper presents first benchmark corpus of Sanskrit Pratyaya (suffix) and inflectional words (padas) formed due to suffixes.
In this work, we prepared a Sanskrit suffix benchmark corpus called Pratyaya-Kosh to evaluate the performance of tools.
We also present our own neural approach for derivative nouns analysis while evaluating the same on most prominent Sanskrit Morphological Analysis tools.
arXiv Detail & Related papers (2020-10-24T17:22:44Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z) - Linguistic Resources for Bhojpuri, Magahi and Maithili: Statistics about
them, their Similarity Estimates, and Baselines for Three Applications [0.6649753747542209]
Bhojpuri, Magahi, and Maithili are low-resource languages of the Purvanchal region of India.
We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels.
The results were compared with a standard Hindi corpus.
arXiv Detail & Related papers (2020-04-29T03:58:55Z) - Investigating Language Impact in Bilingual Approaches for Computational
Language Documentation [28.838960956506018]
This paper investigates how the choice of translation language affects the posterior documentation work.
We create 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment.
Our results suggest that incorporating clues into the neural models' input representation increases their translation and alignment quality.
arXiv Detail & Related papers (2020-03-30T10:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.