A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes
- URL: http://arxiv.org/abs/2010.00170v3
- Date: Wed, 13 Jan 2021 17:19:52 GMT
- Title: A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes
- Authors: Samiul Alam, Tahsin Reasat, Asif Shahriyar Sushmit, Sadi Mohammad
Siddiquee, Fuad Rahman, Mahady Hasan, Ahmed Imtiaz Humayun
- Abstract summary: We propose a labeling scheme that makes segmentation in-side alpha-syllabary words linear.
The dataset contains 411k curated samples of 1295 unique commonly used Bengali graphemes.
The dataset is open-sourced as a part of a public Handwritten Grapheme Classification Challenge on Kaggle.
- Score: 1.009810782568186
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Latin has historically led the state-of-the-art in handwritten optical
character recognition (OCR) research. Adapting existing systems from Latin to
alpha-syllabary languages is particularly challenging due to a sharp contrast
between their orthographies. The segmentation of graphical constituents
corresponding to characters becomes significantly hard due to a cursive writing
system and frequent use of diacritics in the alpha-syllabary family of
languages. We propose a labeling scheme based on graphemes (linguistic segments
of word formation) that makes segmentation in-side alpha-syllabary words linear
and present the first dataset of Bengali handwritten graphemes that are
commonly used in an everyday context. The dataset contains 411k curated samples
of 1295 unique commonly used Bengali graphemes. Additionally, the test set
contains 900 uncommon Bengali graphemes for out of dictionary performance
evaluation. The dataset is open-sourced as a part of a public Handwritten
Grapheme Classification Challenge on Kaggle to benchmark vision algorithms for
multi-target grapheme classification. The unique graphemes present in this
dataset are selected based on commonality in the Google Bengali ASR corpus.
From competition proceedings, we see that deep-learning methods can generalize
to a large span of out of dictionary graphemes which are absent during
training. Dataset and starter codes at www.kaggle.com/c/bengaliai-cv19.
Related papers
- Bukva: Russian Sign Language Alphabet [75.42794328290088]
This paper investigates the recognition of the Russian fingerspelling alphabet, also known as the Russian Sign Language (RSL) dactyl.
Dactyl is a component of sign languages where distinct hand movements represent individual letters of a written language.
We provide Bukva, the first full-fledged open-source video dataset for RSL dactyl recognition.
arXiv Detail & Related papers (2024-10-11T09:59:48Z) - Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets [51.74296438621836]
We introduce Scribbles for All, a label and training data generation algorithm for semantic segmentation trained on scribble labels.
The main limitation of scribbles as source for weak supervision is the lack of challenging datasets for scribble segmentation.
Scribbles for All provides scribble labels for several popular segmentation datasets and provides an algorithm to automatically generate scribble labels for any dataset with dense annotations.
arXiv Detail & Related papers (2024-08-22T15:29:08Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings [20.25180279903009]
We propose Contrastive Graph-Text pretraining (ConGraT) for jointly learning separate representations of texts and nodes in a text-attributed graph (TAG)
Our method trains a language model (LM) and a graph neural network (GNN) to align their representations in a common latent space using a batch-wise contrastive learning objective inspired by CLIP.
Experiments demonstrate that ConGraT outperforms baselines on various downstream tasks, including node and text category classification, link prediction, and language modeling.
arXiv Detail & Related papers (2023-05-23T17:53:30Z) - Unicode Normalization and Grapheme Parsing of Indic Languages [2.974799610163104]
Writing systems of Indic languages have orthographic syllables, also known as complex graphemes, as unique horizontal units.
Our proposed normalizer is a more efficient and effective tool than the previously used Indic normalizer.
We report the pipeline for the scripts of 7 languages in this work and develop the framework for the integration of more scripts.
arXiv Detail & Related papers (2023-05-11T14:34:08Z) - A Benchmark and Dataset for Post-OCR text correction in Sanskrit [23.45279030301887]
Sanskrit is a classical language with about 30 million extant manuscripts fit for digitisation.
We release a post-OCR text correction dataset containing around 218,000 sentences, with 1.5 million words, from 30 different books.
arXiv Detail & Related papers (2022-11-15T08:32:18Z) - Improving Graph-Based Text Representations with Character and Word Level
N-grams [30.699644290131044]
We propose a new word-character text graph that combines word and character n-gram nodes together with document nodes.
We also propose two new graph-based neural models, WCTextGCN and WCTextGAT, for modeling our proposed text graph.
arXiv Detail & Related papers (2022-10-12T08:07:54Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - Bengali Handwritten Grapheme Classification: Deep Learning Approach [0.0]
We participate in a Kaggle competition citek_link where the challenge is to classify three constituent elements of a Bengali grapheme in the image.
We explore the performances of some existing neural network models such as Multi-Layer Perceptron (MLP) and state of the art ResNet50.
We propose our own convolution neural network (CNN) model for Bengali grapheme classification with validation root accuracy 95.32%, vowel accuracy 98.61%, and consonant accuracy 98.76%.
arXiv Detail & Related papers (2021-11-16T06:14:59Z) - Hierarchical Heterogeneous Graph Representation Learning for Short Text
Classification [60.233529926965836]
We propose a new method called SHINE, which is based on graph neural network (GNN) for short text classification.
First, we model the short text dataset as a hierarchical heterogeneous graph consisting of word-level component graphs.
Then, we dynamically learn a short document graph that facilitates effective label propagation among similar short texts.
arXiv Detail & Related papers (2021-10-30T05:33:05Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.