Related papers: Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset Repository

Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset Repository

URL: http://arxiv.org/abs/2410.11291v2
Date: Wed, 16 Oct 2024 06:25:57 GMT
Title: Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset Repository
Authors: S. Tamang, D. J. Bora,
Abstract summary: This paper introduces a centralized, open-source dataset repository designed to advance NLP and NMT for Assamese, a low-resource language. The repository, available at GitHub, supports various tasks like sentiment analysis, named entity recognition, and machine translation by providing both pre-training and fine-tuning corpora.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces a centralized, open-source dataset repository designed to advance NLP and NMT for Assamese, a low-resource language. The repository, available at GitHub, supports various tasks like sentiment analysis, named entity recognition, and machine translation by providing both pre-training and fine-tuning corpora. We review existing datasets, highlighting the need for standardized resources in Assamese NLP, and discuss potential applications in AI-driven research, such as LLMs, OCR, and chatbots. While promising, challenges like data scarcity and linguistic diversity remain. The repository aims to foster collaboration and innovation, promoting Assamese language research in the digital age.

Related papers

SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods [1.2091341579150698]
We release datasets of sentences containing polysemous words across ten low-resource languages.<n>To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method.<n>Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation.
arXiv Detail & Related papers (2025-05-29T17:48:08Z)
Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages? [2.7624021966289605]
Named Entity Recognition for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available.<n>Data augmentation for increasing the amount of low-resource labeled data is a common practice.<n>Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.
arXiv Detail & Related papers (2025-05-22T15:50:47Z)
MAGE: Multi-Head Attention Guided Embeddings for Low Resource Sentiment Classification [0.19381162067627603]
We introduce an advanced model combining Language-Independent Data Augmentation (LiDA) with Multi-Head Attention based weighted embeddings. This approach not only addresses the data scarcity issue but also sets a foundation for future research in low-resource language processing and classification tasks.
arXiv Detail & Related papers (2025-02-25T08:53:27Z)
WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages [62.1053122134059]
The paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages. We have developed a systematic data processing framework tailored for low-resource languages.
arXiv Detail & Related papers (2025-01-24T14:06:29Z)
Open or Closed LLM for Lesser-Resourced Languages? Lessons from Greek [2.3499129784547663]
We evaluate the performance of open-source (Llama-70b) and closed-source (GPT-4o mini) large language models on seven core NLP tasks with dataset availability. Second, we expand the scope of Greek NLP by reframing Authorship Attribution as a tool to assess potential data usage by LLMs in pre-training. Third, we showcase a legal NLP case study, where a Summarize, Translate, and Embed (STE) methodology outperforms the traditional TF-IDF approach for clustering emphlong legal texts.
arXiv Detail & Related papers (2025-01-22T12:06:16Z)
SwaQuAD-24: QA Benchmark Dataset in Swahili [0.0]
This paper proposes the creation of a Swahili Question Answering (QA) benchmark dataset. The dataset will focus on providing high-quality, annotated question-answer pairs that capture the linguistic diversity and complexity of Swahili. Ethical considerations, such as data privacy, bias mitigation, and inclusivity, are central to the dataset development.
arXiv Detail & Related papers (2024-10-18T08:49:24Z)
EthioMT: Parallel Corpus for Low-resource Ethiopian Languages [49.80726355048843]
We introduce EthioMT -- a new parallel corpus for 15 languages. We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia. We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.
arXiv Detail & Related papers (2024-03-28T12:26:45Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration [97.68234051078997]
We discuss how Pyserini can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts. We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub. We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections.
arXiv Detail & Related papers (2023-06-02T12:09:59Z)
Language Agnostic Data-Driven Inverse Text Normalization [6.43601166279978]
inverse text normalization (ITN) problem attracts the attention of researchers from various fields. Due to the scarcity of labeled spoken-written datasets, the studies on non-English data-driven ITN are quite limited. We propose a language-agnostic data-driven ITN framework to fill this gap.
arXiv Detail & Related papers (2023-01-20T10:33:03Z)
NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z)
Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources [38.814057529254846]
We examine the characteristics of 156 publicly available NLP datasets. We survey language-proficient NLP researchers and crowd workers per language. We identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform.
arXiv Detail & Related papers (2022-11-28T18:54:33Z)
Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources. Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages. We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z)
FedNLP: A Research Platform for Federated Learning in Natural Language Processing [55.01246123092445]
We present the FedNLP, a research platform for federated learning in NLP. FedNLP supports various popular task formulations in NLP such as text classification, sequence tagging, question answering, seq2seq generation, and language modeling. Preliminary experiments with FedNLP reveal that there exists a large performance gap between learning on decentralized and centralized datasets.
arXiv Detail & Related papers (2021-04-18T11:04:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.