BNLP: Natural language processing toolkit for Bengali language
- URL: http://arxiv.org/abs/2102.00405v1
- Date: Sun, 31 Jan 2021 07:56:08 GMT
- Title: BNLP: Natural language processing toolkit for Bengali language
- Authors: Sagor Sarker
- Abstract summary: BNLP is an open source language processing toolkit for Bengali language.
It consists of tokenization, word embedding, POS tagging, NER tagging facilities.
BNLP is using widely in the Bengali research communities with 16K downloads, 119 stars and 31 forks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: BNLP is an open source language processing toolkit for Bengali language
consisting with tokenization, word embedding, POS tagging, NER tagging
facilities. BNLP provides pre-trained model with high accuracy to do model
based tokenization, embedding, POS tagging, NER tagging task for Bengali
language. BNLP pre-trained model achieves significant results in Bengali text
tokenization, word embedding, POS tagging and NER tagging task. BNLP is using
widely in the Bengali research communities with 16K downloads, 119 stars and 31
forks. BNLP is available at https://github.com/sagorbrur/bnlp.
Related papers
- Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE [0.0]
BengaliBPE is a language-aware subword tokenizer for the Bengali script.<n>It applies Unicode normalization and morphology-aware merge rules to maintain linguistic consistency and preserve subword integrity.<n>It provides the most detailed segmentation and the best morphological interpretability, albeit with slightly higher computational cost.
arXiv Detail & Related papers (2025-11-07T15:23:32Z) - Multichannel Attention Networks with Ensembled Transfer Learning to Recognize Bangla Handwritten Charecter [1.5236380958983642]
The study employed a convolutional neural network (CNN) with ensemble transfer learning and a multichannel attention network.
We evaluated the proposed model using the CAMTERdb 3.1.2 data set and achieved 92% accuracy for the raw dataset and 98.00% for the preprocessed dataset.
arXiv Detail & Related papers (2024-08-20T15:51:01Z) - Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs [2.309018557701645]
We aim to explore the question of whether there is a need for English-oriented Large Language Models dedicated to a low-resource language.
We compare the performance of open-weight and closed-source LLMs against fine-tuned encoder-decoder models.
Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent.
arXiv Detail & Related papers (2024-06-29T11:50:16Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - mahaNLP: A Marathi Natural Language Processing Library [0.4499833362998489]
We present mahaNLP, an open-source natural language processing (NLP) library specifically built for the Marathi language.
It aims to enhance the support for the low-resource Indian language Marathi in the field of NLP.
arXiv Detail & Related papers (2023-11-05T06:59:59Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - naab: A ready-to-use plug-and-play corpus for Farsi [1.381198851698147]
naab is the largest publicly available, cleaned, and ready-to-use Farsi textual corpus.
Naab consists of 130GB of data, comprising over 250 million paragraphs and 15 billion words.
Naab-raw is an unprocessed version of the dataset, along with a pre-processing toolkit.
arXiv Detail & Related papers (2022-08-29T10:40:58Z) - Number Entity Recognition [65.80137628972312]
Numbers are essential components of text, like any other word tokens, from which natural language processing (NLP) models are built and deployed.
In this work, we attempt to tap this potential of state-of-the-art NLP models and transfer their ability to boost performance in related tasks.
Our proposed classification of numbers into entities helps NLP models perform well on several tasks, including a handcrafted Fill-In-The-Blank (FITB) task and on question answering using joint embeddings.
arXiv Detail & Related papers (2022-05-07T05:22:43Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Bengali Handwritten Grapheme Classification: Deep Learning Approach [0.0]
We participate in a Kaggle competition citek_link where the challenge is to classify three constituent elements of a Bengali grapheme in the image.
We explore the performances of some existing neural network models such as Multi-Layer Perceptron (MLP) and state of the art ResNet50.
We propose our own convolution neural network (CNN) model for Bengali grapheme classification with validation root accuracy 95.32%, vowel accuracy 98.61%, and consonant accuracy 98.76%.
arXiv Detail & Related papers (2021-11-16T06:14:59Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New
Datasets for Bengali-English Machine Translation [6.2418269277908065]
Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources.
We build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups.
With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs.
arXiv Detail & Related papers (2020-09-20T06:06:27Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.