Resources for Turkish Natural Language Processing: A critical survey
- URL: http://arxiv.org/abs/2204.05042v1
- Date: Mon, 11 Apr 2022 12:23:07 GMT
- Title: Resources for Turkish Natural Language Processing: A critical survey
- Authors: \c{C}a\u{g}r{\i} \c{C}\"oltekin, A. Seza Do\u{g}ru\"oz, \"Ozlem
\c{C}etino\u{g}lu
- Abstract summary: We review a broad range of resources, focusing on the ones that are publicly available.
We present a set of recommendations, and identify gaps in the data available for conducting research and building applications in Turkish Linguistics and Natural Language Processing.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a comprehensive survey of corpora and lexical resources
available for Turkish. We review a broad range of resources, focusing on the
ones that are publicly available. In addition to providing information about
the available linguistic resources, we present a set of recommendations, and
identify gaps in the data available for conducting research and building
applications in Turkish Linguistics and Natural Language Processing.
Related papers
- WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages [62.1053122134059]
The paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages.
We have developed a systematic data processing framework tailored for low-resource languages.
arXiv Detail & Related papers (2025-01-24T14:06:29Z) - Recent Advancements and Challenges of Turkic Central Asian Language Processing [4.189204855014775]
Research in NLP for Central Asian Turkic languages faces typical low-resource language challenges.
Recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks.
arXiv Detail & Related papers (2024-07-06T08:58:26Z) - Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature.
We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z) - LLMs Are Few-Shot In-Context Low-Resource Language Learners [59.74451570590808]
In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages.
We extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages.
Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs.
arXiv Detail & Related papers (2024-03-25T07:55:29Z) - Identifying Informational Sources in News Articles [109.70475599552523]
We build the largest and widest-ranging annotated dataset of informational sources used in news writing.
We introduce a novel task, source prediction, to study the compositionality of sources in news articles.
arXiv Detail & Related papers (2023-05-24T08:56:35Z) - Reasoning with Language Model Prompting: A Survey [86.96133788869092]
Reasoning, as an essential ability for complex problem-solving, can provide back-end support for various real-world applications.
This paper provides a comprehensive survey of cutting-edge research on reasoning with language model prompting.
arXiv Detail & Related papers (2022-12-19T16:32:42Z) - Beyond Counting Datasets: A Survey of Multilingual Dataset Construction
and Necessary Resources [38.814057529254846]
We examine the characteristics of 156 publicly available NLP datasets.
We survey language-proficient NLP researchers and crowd workers per language.
We identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform.
arXiv Detail & Related papers (2022-11-28T18:54:33Z) - Toward More Meaningful Resources for Lower-resourced Languages [2.3513645401551333]
We examine the contents of the names stored in Wikidata for a few lower-resourced languages.
We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data.
We conclude with recommended guidelines for resource development.
arXiv Detail & Related papers (2022-02-24T18:39:57Z) - Google Crowdsourced Speech Corpora and Related Open-Source Resources for
Low-Resource Languages and Dialects: An Overview [43.92114369646489]
We have released 38 datasets for building text-to-speech and automatic speech recognition applications.
The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.
arXiv Detail & Related papers (2020-10-14T02:24:04Z) - Investigating an approach for low resource language dataset creation,
curation and classification: Setswana and Sepedi [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We also create a news topic classification task.
We investigate an approach on data augmentation, better suited to low resource languages.
arXiv Detail & Related papers (2020-02-18T13:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.