End-to-End Natural Language Understanding Pipeline for Bangla
Conversational Agents
- URL: http://arxiv.org/abs/2107.05541v2
- Date: Tue, 13 Jul 2021 01:52:58 GMT
- Title: End-to-End Natural Language Understanding Pipeline for Bangla
Conversational Agents
- Authors: Fahim Shahriar Khan, Mueeze Al Mushabbir, Mohammad Sabik Irbaz, MD
Abdullah Al Nasim
- Abstract summary: We propose a novel approach to build a business assistant which can communicate in Bangla and Bangla Transliteration in English with high confidence consistently.
We use Rasa Open Source Framework, fastText embeddings, Polyglot embeddings, Flask, and other systems as building blocks.
We present a pipeline for intent classification and entity extraction which achieves reasonable performance.
- Score: 0.43012765978447565
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Chatbots are intelligent software built to be used as a replacement for human
interaction. However, existing studies typically do not provide enough support
for low-resource languages like Bangla. Moreover, due to the increasing
popularity of social media, we can also see the rise of interactions in Bangla
transliteration (mostly in English) among the native Bangla speakers. In this
paper, we propose a novel approach to build a Bangla chatbot aimed to be used
as a business assistant which can communicate in Bangla and Bangla
Transliteration in English with high confidence consistently. Since annotated
data was not available for this purpose, we had to work on the whole machine
learning life cycle (data preparation, machine learning modeling, and model
deployment) using Rasa Open Source Framework, fastText embeddings, Polyglot
embeddings, Flask, and other systems as building blocks. While working with the
skewed annotated dataset, we try out different setups and pipelines to evaluate
which works best and provide possible reasoning behind the observed results.
Finally, we present a pipeline for intent classification and entity extraction
which achieves reasonable performance (accuracy: 83.02\%, precision: 80.82\%,
recall: 83.02\%, F1-score: 80\%).
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Enhancing Bangla Language Next Word Prediction and Sentence Completion through Extended RNN with Bi-LSTM Model On N-gram Language [1.3693860189056777]
This paper introduces a Bi-LSTM model that effectively handles Bangla next-word prediction and Bangla sentence generation.
We constructed a corpus dataset from various news portals, including bdnews24, BBC News Bangla, and Prothom Alo.
The proposed approach achieved superior results in word prediction, reaching 99% accuracy for both 4-gram and 5-gram word predictions.
arXiv Detail & Related papers (2024-05-03T06:06:01Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings.
Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z) - Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on
Self-Chat Data [101.63682141248069]
Chat models, such as ChatGPT, have shown impressive capabilities and have been rapidly adopted across numerous domains.
We propose a pipeline that can automatically generate a high-quality multi-turn chat corpus by leveraging ChatGPT.
We employ parameter-efficient tuning to enhance LLaMA, an open-source large language model.
arXiv Detail & Related papers (2023-04-03T17:59:09Z) - Incongruity Detection between Bangla News Headline and Body Content
through Graph Neural Network [0.0]
Incongruity between news headlines and body content is a common method of deception used to attract readers.
We propose a graph-based hierarchical dual encoder model that learns the content similarity and contradiction between Bangla news headlines and content paragraphs effectively.
The proposed Bangla graph-based neural network model achieves above 90% accuracy on various Bangla news datasets.
arXiv Detail & Related papers (2022-10-26T20:57:45Z) - BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset [3.922582192616519]
We present BanglaParaphrase, a high-quality synthetic Bangla Paraphrase dataset curated by a novel filtering pipeline.
We aim to take a step towards alleviating the low resource status of the Bangla language in the NLP domain through the introduction of BanglaParaphrase.
arXiv Detail & Related papers (2022-10-11T02:52:31Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - BanglaBERT: Combating Embedding Barrier for Low-Resource Language
Understanding [1.7000879291900044]
We build a Bangla natural language understanding model pre-trained on 18.6 GB data we crawled from top Bangla sites on the internet.
Our model outperforms multilingual baselines and previous state-of-the-art results by 1-6%.
We identify a major shortcoming of multilingual models that hurt performance for low-resource languages that don't share writing scripts with any high resource one.
arXiv Detail & Related papers (2021-01-01T09:28:45Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural
Language Processing [64.87699383581885]
We introduce TextBrewer, an open-source knowledge distillation toolkit for natural language processing.
It supports various kinds of supervised learning tasks, such as text classification, reading comprehension, sequence labeling.
As a case study, we use TextBrewer to distill BERT on several typical NLP tasks.
arXiv Detail & Related papers (2020-02-28T09:44:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.