Extractive Question Answering on Queries in Hindi and Tamil
- URL: http://arxiv.org/abs/2210.06356v1
- Date: Tue, 27 Sep 2022 00:40:21 GMT
- Title: Extractive Question Answering on Queries in Hindi and Tamil
- Authors: Adhitya Thirumala, Elisa Ferracane
- Abstract summary: Indic languages like Hindi and Tamil are underrepresented in the natural language processing (NLP) field compared to languages like English.
The goal of this project is to build an NLP model that performs better than pre-existing models for the task of extractive question-answering (QA) on a public dataset in Hindi and Tamil.
- Score: 2.66512000865131
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Indic languages like Hindi and Tamil are underrepresented in the natural
language processing (NLP) field compared to languages like English. Due to this
underrepresentation, performance on NLP tasks (such as search algorithms) in
Indic languages are inferior to their English counterparts. This difference
disproportionately affects those who come from lower socioeconomic statuses
because they consume the most Internet content in local languages. The goal of
this project is to build an NLP model that performs better than pre-existing
models for the task of extractive question-answering (QA) on a public dataset
in Hindi and Tamil. Extractive QA is an NLP task where answers to questions are
extracted from a corresponding body of text. To build the best solution, we
used three different models. The first model is an unmodified cross-lingual
version of the NLP model RoBERTa, known as XLM-RoBERTa, that is pretrained on
100 languages. The second model is based on the pretrained RoBERTa model with
an extra classification head for the question answering, but we used a custom
Indic tokenizer, then optimized hyperparameters and fine tuned on the Indic
dataset. The third model is based on XLM-RoBERTa, but with extra finetuning and
training on the Indic dataset. We hypothesize the third model will perform best
because of the variety of languages the XLM-RoBERTa model has been pretrained
on and the additional finetuning on the Indic dataset. This hypothesis was
proven wrong because the paired RoBERTa models performed the best as the
training data used was most specific to the task performed as opposed to the
XLM-RoBERTa models which had much data that was not in either Hindi or Tamil.
Related papers
- HindiLLM: Large Language Model for Hindi [0.09363323206192666]
We have pre-trained two autoregressive Large Language Model (LLM) models for the Hindi language.
We use a two-step process comprising unsupervised pre-training and supervised fine-tuning.
The evaluation shows that the HindiLLM-based fine-tuned models outperform several models in most of the language related tasks.
arXiv Detail & Related papers (2024-12-29T05:28:15Z) - Table Question Answering for Low-resourced Indic Languages [71.57359949962678]
TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output.
We introduce a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget.
We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models.
arXiv Detail & Related papers (2024-10-04T16:26:12Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - A Comparative Study of Transformer-Based Language Models on Extractive
Question Answering [0.5079811885340514]
We train various pre-trained language models and fine-tune them on multiple question answering datasets.
Using the F1-score as our metric, we find that the RoBERTa and BART pre-trained models perform the best across all datasets.
arXiv Detail & Related papers (2021-10-07T02:23:19Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z) - Learning Contextual Representations for Semantic Parsing with
Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data.
Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Learning Which Features Matter: RoBERTa Acquires a Preference for
Linguistic Generalizations (Eventually) [25.696099563130517]
We introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set)
MSGS consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during fine-tuning.
We pretrain RoBERTa models from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa-base.
We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones.
arXiv Detail & Related papers (2020-10-11T22:09:27Z) - WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.