Building Tamil Treebanks
- URL: http://arxiv.org/abs/2409.14657v1
- Date: Mon, 23 Sep 2024 01:58:50 GMT
- Title: Building Tamil Treebanks
- Authors: Kengatharaiyer Sarveswaran,
- Abstract summary: Treebanks are important linguistic resources, which are structured and annotated corpora with rich linguistic annotations.
This paper discusses the creation of Tamil treebanks using three distinct approaches: manual annotation, computational grammars, and machine learning techniques.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Treebanks are important linguistic resources, which are structured and annotated corpora with rich linguistic annotations. These resources are used in Natural Language Processing (NLP) applications, supporting linguistic analyses, and are essential for training and evaluating various computational models. This paper discusses the creation of Tamil treebanks using three distinct approaches: manual annotation, computational grammars, and machine learning techniques. Manual annotation, though time-consuming and requiring linguistic expertise, ensures high-quality and rich syntactic and semantic information. Computational deep grammars, such as Lexical Functional Grammar (LFG), offer deep linguistic analyses but necessitate significant knowledge of the formalism. Machine learning approaches, utilising off-the-shelf frameworks and tools like Stanza, UDpipe, and UUParser, facilitate the automated annotation of large datasets but depend on the availability of quality annotated data, cross-linguistic training resources, and computational power. The paper discusses the challenges encountered in building Tamil treebanks, including issues with Internet data, the need for comprehensive linguistic analysis, and the difficulty of finding skilled annotators. Despite these challenges, the development of Tamil treebanks is essential for advancing linguistic research and improving NLP tools for Tamil.
Related papers
- Tamil Language Computing: the Present and the Future [0.0]
Language computing integrates linguistics, computer science, and cognitive psychology to create meaningful human-computer interactions.
Recent advancements in deep learning have made computers more accessible and capable of independent learning and adaptation.
The paper underscores the importance of building practical applications for languages like Tamil to address everyday communication needs.
arXiv Detail & Related papers (2024-07-11T15:56:02Z) - Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus [0.9051256541674136]
This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus.
It is designed to bridge the technological gap in language learning and machine translation for under-resourced languages.
arXiv Detail & Related papers (2024-07-06T21:23:20Z) - Sanskrit Knowledge-based Systems: Annotation and Computational Tools [0.12086712057375555]
We address the challenges and opportunities in the development of knowledge systems for Sanskrit.
This research contributes to the preservation, understanding, and utilization of the rich linguistic information embodied in Sanskrit texts.
arXiv Detail & Related papers (2024-06-26T12:00:10Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Teacher Perception of Automatically Extracted Grammar Concepts for L2
Language Learning [66.79173000135717]
We apply this work to teaching two Indian languages, Kannada and Marathi, which do not have well-developed resources for second language learning.
We extract descriptions from a natural text corpus that answer questions about morphosyntax (learning of word order, agreement, case marking, or word formation) and semantics (learning of vocabulary).
We enlist the help of language educators from schools in North America to perform a manual evaluation, who find the materials have potential to be used for their lesson preparation and learner evaluation.
arXiv Detail & Related papers (2023-10-27T18:17:29Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Teacher Perception of Automatically Extracted Grammar Concepts for L2
Language Learning [91.49622922938681]
We present an automatic framework that automatically discovers and visualizing descriptions of different aspects of grammar.
Specifically, we extract descriptions from a natural text corpus that answer questions about morphosyntax and semantics.
We apply this method for teaching the Indian languages, Kannada and Marathi, which, unlike English, do not have well-developed pedagogical resources.
arXiv Detail & Related papers (2022-06-10T14:52:22Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Natural Language Processing Advancements By Deep Learning: A Survey [0.755972004983746]
This survey categorizes and addresses the different aspects and applications of NLP that have benefited from deep learning.
It covers core NLP tasks and applications and describes how deep learning methods and models advance these areas.
arXiv Detail & Related papers (2020-03-02T21:32:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.