ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition
- URL: http://arxiv.org/abs/2502.11198v1
- Date: Sun, 16 Feb 2025 16:59:10 GMT
- Title: ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition
- Authors: Bidyarthi Paul, Faika Fairuj Preotee, Shuvashis Sarker, Shamim Rahim Refat, Shifat Islam, Tashreef Muhammad, Mohammad Ashraful Hoque, Shahriar Manzoor,
- Abstract summary: The dataset has around 10,443 sentences, 3,481 sentences per region.
The data was collected from two publicly available datasets and through web scraping from various online newspapers, articles.
The dataset is structured into separate subsets for each region and is available both in CSV format.
- Score: 0.8025340896297104
- License:
- Abstract: ANCHOLIK-NER is a linguistically diverse dataset for Named Entity Recognition (NER) in Bangla regional dialects, capturing variations across Sylhet, Chittagong, and Barishal. The dataset has around 10,443 sentences, 3,481 sentences per region. The data was collected from two publicly available datasets and through web scraping from various online newspapers, articles. To ensure high-quality annotations, the BIO tagging scheme was employed, and professional annotators with expertise in regional dialects carried out the labeling process. The dataset is structured into separate subsets for each region and is available both in CSV format. Each entry contains textual data along with identified named entities and their corresponding annotations. Named entities are categorized into ten distinct classes: Person, Location, Organization, Food, Animal, Colour, Role, Relation, Object, and Miscellaneous. This dataset serves as a valuable resource for developing and evaluating NER models for Bangla dialectal variations, contributing to regional language processing and low-resource NLP applications. It can be utilized to enhance NER systems in Bangla dialects, improve regional language understanding, and support applications in machine translation, information retrieval, and conversational AI.
Related papers
- Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition [10.244515100904144]
In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset.
We developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios.
We benchmarked the trained ASR with publicly available datasets and compared it with other available models.
Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets.
arXiv Detail & Related papers (2023-11-06T15:37:14Z) - BanglaCoNER: Towards Robust Bangla Complex Named Entity Recognition [0.0]
We present the winning solution of Bangla Complex Named Entity Recognition Challenge.
The dataset consisted of 15300 sentences for training and 800 sentences for validation, in the.conll format.
Our findings also demonstrate the efficacy of Deep Learning models such as BanglaBERT for NER in Bangla language.
arXiv Detail & Related papers (2023-03-16T13:31:31Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - AsNER -- Annotated Dataset and Baseline for Assamese Named Entity
recognition [7.252817150901275]
The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing.
We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition.
The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method.
arXiv Detail & Related papers (2022-07-07T16:45:55Z) - HiNER: A Large Hindi Named Entity Recognition Dataset [29.300418937509317]
This paper releases a standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.
The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation.
Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper.
arXiv Detail & Related papers (2022-04-28T19:14:21Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - Dataset Geography: Mapping Language Data to Language Users [17.30955185832338]
We study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers.
In doing so, we use entity recognition and linking systems, also making important observations about their cross-lingual consistency.
Last, we explore some geographical and economic factors that may explain the observed distributions dataset.
arXiv Detail & Related papers (2021-12-07T05:13:50Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - PhraseCut: Language-based Image Segmentation in the Wild [62.643450401286]
We consider the problem of segmenting image regions given a natural language phrase.
Our dataset is collected on top of the Visual Genome dataset.
Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art.
arXiv Detail & Related papers (2020-08-03T20:58:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.