BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset
- URL: http://arxiv.org/abs/2210.05109v1
- Date: Tue, 11 Oct 2022 02:52:31 GMT
- Title: BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset
- Authors: Ajwad Akil, Najrin Sultana, Abhik Bhattacharjee and Rifat Shahriyar
- Abstract summary: We present BanglaParaphrase, a high-quality synthetic Bangla Paraphrase dataset curated by a novel filtering pipeline.
We aim to take a step towards alleviating the low resource status of the Bangla language in the NLP domain through the introduction of BanglaParaphrase.
- Score: 3.922582192616519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we present BanglaParaphrase, a high-quality synthetic Bangla
Paraphrase dataset curated by a novel filtering pipeline. We aim to take a step
towards alleviating the low resource status of the Bangla language in the NLP
domain through the introduction of BanglaParaphrase, which ensures quality by
preserving both semantics and diversity, making it particularly useful to
enhance other Bangla datasets. We show a detailed comparative analysis between
our dataset and models trained on it with other existing works to establish the
viability of our synthetic paraphrase data generation pipeline. We are making
the dataset and models publicly available at
https://github.com/csebuetnlp/banglaparaphrase to further the state of Bangla
NLP.
Related papers
- Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR
Back-Translation [59.91139600152296]
ParaAMR is a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation.
We show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning.
arXiv Detail & Related papers (2023-05-26T02:27:33Z) - On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings.
Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z) - Improving Keyphrase Extraction with Data Augmentation and Information
Filtering [67.43025048639333]
Keyphrase extraction is one of the essential tasks for document understanding in NLP.
We present a novel corpus and method for keyphrase extraction from the videos streamed on the Behance platform.
arXiv Detail & Related papers (2022-09-11T22:38:02Z) - BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset [0.5893124686141781]
Resource-constrained languages like Bangla remain out of focus, predominantly due to a lack of standard datasets.
We present a new dataset BAN-Cap following the widely used Flickr8k dataset, where we collect Bangla captions of the images provided by qualified annotators.
We investigate the effect of text augmentation and demonstrate that an adaptive attention-based model combined with text augmentation using Contextualized Word Replacement (CWR) outperforms all state-of-the-art models for Bangla image captioning.
arXiv Detail & Related papers (2022-05-28T15:39:09Z) - BanglaNLG: Benchmarks and Resources for Evaluating Low-Resource Natural
Language Generation in Bangla [21.47743471497797]
This work presents a benchmark for evaluating natural language generation models in Bangla.
We aggregate three challenging conditional text generation tasks under the BanglaNLG benchmark.
Using a clean corpus of 27.5 GB of Bangla data, we pretrain BanglaT5, a sequence-to-sequence Transformer model for Bangla.
BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming mT5 (base) by up to 5.4%.
arXiv Detail & Related papers (2022-05-23T06:54:56Z) - Retrieval-Augmented Multilingual Keyphrase Generation with
Retriever-Generator Iterative Training [66.64843711515341]
Keyphrase generation is the task of automatically predicting keyphrases given a piece of long text.
We call attention to a new setting named multilingual keyphrase generation.
We propose a retrieval-augmented method for multilingual keyphrase generation to mitigate the data shortage problem in non-English languages.
arXiv Detail & Related papers (2022-05-21T00:45:21Z) - WANLI: Worker and AI Collaboration for Natural Language Inference
Dataset Creation [101.00109827301235]
We introduce a novel paradigm for dataset creation based on human and machine collaboration.
We use dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instruct GPT-3 to compose new examples with similar patterns.
The resulting dataset, WANLI, consists of 108,357 natural language inference (NLI) examples that present unique empirical strengths.
arXiv Detail & Related papers (2022-01-16T03:13:49Z) - End-to-End Natural Language Understanding Pipeline for Bangla
Conversational Agents [0.43012765978447565]
We propose a novel approach to build a business assistant which can communicate in Bangla and Bangla Transliteration in English with high confidence consistently.
We use Rasa Open Source Framework, fastText embeddings, Polyglot embeddings, Flask, and other systems as building blocks.
We present a pipeline for intent classification and entity extraction which achieves reasonable performance.
arXiv Detail & Related papers (2021-07-12T16:09:22Z) - A Review of Bangla Natural Language Processing Tasks and the Utility of
Transformer Models [2.5768647103950357]
We provide a review of Bangla NLP tasks, resources, and tools available to the research community.
We benchmark datasets collected from various platforms for nine NLP tasks using current state-of-the-art algorithms.
We report our results using both individual and consolidated datasets and provide data for future research.
arXiv Detail & Related papers (2021-07-08T13:49:46Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.