Related papers: End-to-End Natural Language Understanding Pipeline for Bangla Conversational Agents

End-to-End Natural Language Understanding Pipeline for Bangla Conversational Agents

URL: http://arxiv.org/abs/2107.05541v2
Date: Tue, 13 Jul 2021 01:52:58 GMT
Title: End-to-End Natural Language Understanding Pipeline for Bangla Conversational Agents
Authors: Fahim Shahriar Khan, Mueeze Al Mushabbir, Mohammad Sabik Irbaz, MD Abdullah Al Nasim
Abstract summary: We propose a novel approach to build a business assistant which can communicate in Bangla and Bangla Transliteration in English with high confidence consistently. We use Rasa Open Source Framework, fastText embeddings, Polyglot embeddings, Flask, and other systems as building blocks. We present a pipeline for intent classification and entity extraction which achieves reasonable performance.
Score: 0.43012765978447565
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Chatbots are intelligent software built to be used as a replacement for human interaction. However, existing studies typically do not provide enough support for low-resource languages like Bangla. Moreover, due to the increasing popularity of social media, we can also see the rise of interactions in Bangla transliteration (mostly in English) among the native Bangla speakers. In this paper, we propose a novel approach to build a Bangla chatbot aimed to be used as a business assistant which can communicate in Bangla and Bangla Transliteration in English with high confidence consistently. Since annotated data was not available for this purpose, we had to work on the whole machine learning life cycle (data preparation, machine learning modeling, and model deployment) using Rasa Open Source Framework, fastText embeddings, Polyglot embeddings, Flask, and other systems as building blocks. While working with the skewed annotated dataset, we try out different setups and pipelines to evaluate which works best and provide possible reasoning behind the observed results. Finally, we present a pipeline for intent classification and entity extraction which achieves reasonable performance (accuracy: 83.02\%, precision: 80.82\%, recall: 83.02\%, F1-score: 80\%).

Related papers

BanglaEmbed: Efficient Sentence Embedding Models for a Low-Resource Language Using Cross-Lingual Distillation Techniques [0.0]
This work introduces two lightweight sentence transformers for the Bangla language. This method distills knowledge from a pre-trained, high-performing English sentence transformer. The new method consistently outperformed existing Bangla sentence transformers.
arXiv Detail & Related papers (2024-11-22T13:03:25Z)
A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding. There is no publicly available NLI corpus for the Romanian language. We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z)
Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
On Evaluation of Bangla Word Analogies [0.8658596218544772]
This paper presents a high-quality dataset for evaluating the quality of Bangla word embeddings. Despite being the 7th most-spoken language in the world, Bangla is a low-resource language and popular NLP models fail to perform well.
arXiv Detail & Related papers (2023-04-10T14:27:35Z)
Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data [101.63682141248069]
Chat models, such as ChatGPT, have shown impressive capabilities and have been rapidly adopted across numerous domains. We propose a pipeline that can automatically generate a high-quality multi-turn chat corpus by leveraging ChatGPT. We employ parameter-efficient tuning to enhance LLaMA, an open-source large language model.
arXiv Detail & Related papers (2023-04-03T17:59:09Z)
Incongruity Detection between Bangla News Headline and Body Content through Graph Neural Network [0.0]
Incongruity between news headlines and body content is a common method of deception used to attract readers. We propose a graph-based hierarchical dual encoder model that learns the content similarity and contradiction between Bangla news headlines and content paragraphs effectively. The proposed Bangla graph-based neural network model achieves above 90% accuracy on various Bangla news datasets.
arXiv Detail & Related papers (2022-10-26T20:57:45Z)
BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset [3.922582192616519]
We present BanglaParaphrase, a high-quality synthetic Bangla Paraphrase dataset curated by a novel filtering pipeline. We aim to take a step towards alleviating the low resource status of the Bangla language in the NLP domain through the introduction of BanglaParaphrase.
arXiv Detail & Related papers (2022-10-11T02:52:31Z)
No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z)
BanglaBERT: Combating Embedding Barrier for Low-Resource Language Understanding [1.7000879291900044]
We build a Bangla natural language understanding model pre-trained on 18.6 GB data we crawled from top Bangla sites on the internet. Our model outperforms multilingual baselines and previous state-of-the-art results by 1-6%. We identify a major shortcoming of multilingual models that hurt performance for low-resource languages that don't share writing scripts with any high resource one.
arXiv Detail & Related papers (2021-01-01T09:28:45Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing [64.87699383581885]
We introduce TextBrewer, an open-source knowledge distillation toolkit for natural language processing. It supports various kinds of supervised learning tasks, such as text classification, reading comprehension, sequence labeling. As a case study, we use TextBrewer to distill BERT on several typical NLP tasks.
arXiv Detail & Related papers (2020-02-28T09:44:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.