BanglaWriting: A multi-purpose offline Bangla handwriting dataset
- URL: http://arxiv.org/abs/2011.07499v3
- Date: Fri, 19 Aug 2022 14:06:08 GMT
- Title: BanglaWriting: A multi-purpose offline Bangla handwriting dataset
- Authors: M. F. Mridha, Abu Quwsar Ohi, M. Ameer Ali, Mazedul Islam Emon,
Muhammad Mohsin Kabir
- Abstract summary: This article presents a Bangla handwriting dataset that contains single-page handwritings of 260 individuals of different personalities.
This dataset contains 21,234 and 450 characters in total, along with this page representation of 32,470 unique words.
The dataset can be used for complex optical character recognition, handwritten word identification, handwriting variation and writer word generation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This article presents a Bangla handwriting dataset named BanglaWriting that
contains single-page handwritings of 260 individuals of different personalities
and ages. Each page includes bounding-boxes that bounds each word, along with
the unicode representation of the writing. This dataset contains 21,234 words
and 32,787 characters in total. Moreover, this dataset includes 5,470 unique
words of Bangla vocabulary. Apart from the usual words, the dataset comprises
261 comprehensible overwriting and 450 handwritten strikes and mistakes. All of
the bounding-boxes and word labels are manually-generated. The dataset can be
used for complex optical character/word recognition, writer identification,
handwritten word segmentation, and word generation. Furthermore, this dataset
is suitable for extracting age-based and gender-based variation of handwriting.
Related papers
- Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition [5.28595286827031]
The Manuscripts of Handwritten Arabic(Muharaf) dataset is a machine learning dataset consisting of more than 1,600 historic handwritten page images.
This dataset was compiled to advance the state of the art in handwritten text recognition.
arXiv Detail & Related papers (2024-06-13T23:40:34Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - BN-DRISHTI: Bangla Document Recognition through Instance-level
Segmentation of Handwritten Text Images [0.0]
This paper introduces a deep learning-based object detection framework (YOLO) with Hough and Affine transformation for skew correction.
We present an extended version of the BN-HTRd dataset comprising 786 full-page handwritten Bangla document images.
Evaluation on the test portion of our dataset resulted in an F-score of 99.97% for line and 98% for word segmentation.
arXiv Detail & Related papers (2023-05-31T04:08:57Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - BN-HTRd: A Benchmark Dataset for Document Level Offline Bangla
Handwritten Text Recognition (HTR) and Line Segmentation [0.0]
We introduce a new dataset for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words, lines, and document-level annotations.
The BN-HTRd dataset is based on the BBC Bangla News corpus, meant to act as ground truth texts.
Our dataset includes 788 images of handwritten pages produced by approximately 150 different writers.
arXiv Detail & Related papers (2022-05-29T22:56:26Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - MarkBERT: Marking Word Boundaries Improves Chinese BERT [67.53732128091747]
MarkBERT keeps the vocabulary being Chinese characters and inserts boundary markers between contiguous words.
Compared to previous word-based BERT models, MarkBERT achieves better accuracy on text classification, keyword recognition, and semantic similarity tasks.
arXiv Detail & Related papers (2022-03-12T08:43:06Z) - Letter-level Online Writer Identification [86.13203975836556]
We focus on a novel problem, letter-level online writer-id, which requires only a few trajectories of written letters as identification cues.
A main challenge is that a person often writes a letter in different styles from time to time.
We refer to this problem as the variance of online writing styles (Var-O-Styles)
arXiv Detail & Related papers (2021-12-06T07:21:53Z) - A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes [1.009810782568186]
We propose a labeling scheme that makes segmentation in-side alpha-syllabary words linear.
The dataset contains 411k curated samples of 1295 unique commonly used Bengali graphemes.
The dataset is open-sourced as a part of a public Handwritten Grapheme Classification Challenge on Kaggle.
arXiv Detail & Related papers (2020-10-01T01:51:45Z) - MatriVasha: A Multipurpose Comprehensive Database for Bangla Handwritten
Compound Characters [0.0]
MatrriVasha is the project which can recognize Bangla, handwritten several compound characters.
The proposed dataset is so far the most extensive dataset for Bangla compound characters.
arXiv Detail & Related papers (2020-04-29T06:38:12Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.