Vacaspati: A Diverse Corpus of Bangla Literature
- URL: http://arxiv.org/abs/2307.05083v1
- Date: Tue, 11 Jul 2023 07:32:12 GMT
- Title: Vacaspati: A Diverse Corpus of Bangla Literature
- Authors: Pramit Bhattacharyya, Joydeep Mondal, Subhadip Maji, Arnab
Bhattacharya
- Abstract summary: We build Vacaspati, a diverse corpus of Bangla literature.
It contains more than 11 million sentences and 115 million words.
We also built a word embedding model, Vac-FT, using FastText from Vacaspati as well as trained an Electra model, Vac-BERT, using the corpus.
- Score: 4.555256739812733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bangla (or Bengali) is the fifth most spoken language globally; yet, the
state-of-the-art NLP in Bangla is lagging for even simple tasks such as
lemmatization, POS tagging, etc. This is partly due to lack of a varied quality
corpus. To alleviate this need, we build Vacaspati, a diverse corpus of Bangla
literature. The literary works are collected from various websites; only those
works that are publicly available without copyright violations or restrictions
are collected. We believe that published literature captures the features of a
language much better than newspapers, blogs or social media posts which tend to
follow only a certain literary pattern and, therefore, miss out on language
variety. Our corpus Vacaspati is varied from multiple aspects, including type
of composition, topic, author, time, space, etc. It contains more than 11
million sentences and 115 million words. We also built a word embedding model,
Vac-FT, using FastText from Vacaspati as well as trained an Electra model,
Vac-BERT, using the corpus. Vac-BERT has far fewer parameters and requires only
a fraction of resources compared to other state-of-the-art transformer models
and yet performs either better or similar on various downstream tasks. On
multiple downstream tasks, Vac-FT outperforms other FastText-based models. We
also demonstrate the efficacy of Vacaspati as a corpus by showing that similar
models built from other corpora are not as effective. The models are available
at https://bangla.iitk.ac.in/.
Related papers
- Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - Understanding writing style in social media with a supervised
contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation.
We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts.
Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z) - BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models
for Sentiment Analysis of Bangla Social Media Posts [0.46040036610482665]
This paper presents our submission to Task 2 (Sentiment Analysis of Bangla Social Media Posts) of the BLP Workshop.
Our quantitative results show that transfer learning really helps in better learning of the models in this low-resource language scenario.
We obtain a micro-F1 of 67.02% on the test set and our performance in this shared task is ranked at 21 in the leaderboard.
arXiv Detail & Related papers (2023-10-13T16:46:38Z) - Plagiarism Detection in the Bengali Language: A Text Similarity-Based
Approach [0.866842899233181]
Plagiarism is not limited to a single language only. Bengali is the most widely spoken language of Bangladesh and the second most spoken language in India.
We have collected Bengali Literature books from the National Digital Library of India and with a comprehensive methodology extracted texts from it and constructed our corpus.
Our experimental results find out average accuracy between 72.10 % - 79.89 % in text extraction using OCR.
We have built a web application for end-user and successfully tested it for Plagiarism detection in Bengali texts.
arXiv Detail & Related papers (2022-03-25T03:11:00Z) - From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French [57.886210204774834]
We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries).
We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
arXiv Detail & Related papers (2022-02-18T22:17:22Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Neural Models for Offensive Language Detection [0.0]
Offensive language detection is an ever-growing natural language processing (NLP) application.
We believe contributing to improving and comparing different machine learning models to fight such harmful contents is an important and challenging goal for this thesis.
arXiv Detail & Related papers (2021-05-30T13:02:45Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - BanglaBERT: Combating Embedding Barrier for Low-Resource Language
Understanding [1.7000879291900044]
We build a Bangla natural language understanding model pre-trained on 18.6 GB data we crawled from top Bangla sites on the internet.
Our model outperforms multilingual baselines and previous state-of-the-art results by 1-6%.
We identify a major shortcoming of multilingual models that hurt performance for low-resource languages that don't share writing scripts with any high resource one.
arXiv Detail & Related papers (2021-01-01T09:28:45Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.