Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset Repository
- URL: http://arxiv.org/abs/2410.11291v2
- Date: Wed, 16 Oct 2024 06:25:57 GMT
- Title: Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset Repository
- Authors: S. Tamang, D. J. Bora,
- Abstract summary: This paper introduces a centralized, open-source dataset repository designed to advance NLP and NMT for Assamese, a low-resource language.
The repository, available at GitHub, supports various tasks like sentiment analysis, named entity recognition, and machine translation by providing both pre-training and fine-tuning corpora.
- Score: 0.0
- License:
- Abstract: This paper introduces a centralized, open-source dataset repository designed to advance NLP and NMT for Assamese, a low-resource language. The repository, available at GitHub, supports various tasks like sentiment analysis, named entity recognition, and machine translation by providing both pre-training and fine-tuning corpora. We review existing datasets, highlighting the need for standardized resources in Assamese NLP, and discuss potential applications in AI-driven research, such as LLMs, OCR, and chatbots. While promising, challenges like data scarcity and linguistic diversity remain. The repository aims to foster collaboration and innovation, promoting Assamese language research in the digital age.
Related papers
- WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages [62.1053122134059]
The paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages.
We have developed a systematic data processing framework tailored for low-resource languages.
arXiv Detail & Related papers (2025-01-24T14:06:29Z) - Open or Closed LLM for Lesser-Resourced Languages? Lessons from Greek [2.3499129784547663]
We evaluate the performance of open-source (Llama-70b) and closed-source (GPT-4o mini) large language models on seven core NLP tasks with dataset availability.
Second, we expand the scope of Greek NLP by reframing Authorship Attribution as a tool to assess potential data usage by LLMs in pre-training.
Third, we showcase a legal NLP case study, where a Summarize, Translate, and Embed (STE) methodology outperforms the traditional TF-IDF approach for clustering emphlong legal texts.
arXiv Detail & Related papers (2025-01-22T12:06:16Z) - SwaQuAD-24: QA Benchmark Dataset in Swahili [0.0]
This paper proposes the creation of a Swahili Question Answering (QA) benchmark dataset.
The dataset will focus on providing high-quality, annotated question-answer pairs that capture the linguistic diversity and complexity of Swahili.
Ethical considerations, such as data privacy, bias mitigation, and inclusivity, are central to the dataset development.
arXiv Detail & Related papers (2024-10-18T08:49:24Z) - EthioMT: Parallel Corpus for Low-resource Ethiopian Languages [49.80726355048843]
We introduce EthioMT -- a new parallel corpus for 15 languages.
We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia.
We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.
arXiv Detail & Related papers (2024-03-28T12:26:45Z) - GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training
Data Exploration [97.68234051078997]
We discuss how Pyserini can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts.
We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub.
We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections.
arXiv Detail & Related papers (2023-06-02T12:09:59Z) - Language Agnostic Data-Driven Inverse Text Normalization [6.43601166279978]
inverse text normalization (ITN) problem attracts the attention of researchers from various fields.
Due to the scarcity of labeled spoken-written datasets, the studies on non-English data-driven ITN are quite limited.
We propose a language-agnostic data-driven ITN framework to fill this gap.
arXiv Detail & Related papers (2023-01-20T10:33:03Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - Beyond Counting Datasets: A Survey of Multilingual Dataset Construction
and Necessary Resources [38.814057529254846]
We examine the characteristics of 156 publicly available NLP datasets.
We survey language-proficient NLP researchers and crowd workers per language.
We identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform.
arXiv Detail & Related papers (2022-11-28T18:54:33Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - FedNLP: A Research Platform for Federated Learning in Natural Language
Processing [55.01246123092445]
We present the FedNLP, a research platform for federated learning in NLP.
FedNLP supports various popular task formulations in NLP such as text classification, sequence tagging, question answering, seq2seq generation, and language modeling.
Preliminary experiments with FedNLP reveal that there exists a large performance gap between learning on decentralized and centralized datasets.
arXiv Detail & Related papers (2021-04-18T11:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.