Indic-Transformers: An Analysis of Transformer Language Models for
Indian Languages
- URL: http://arxiv.org/abs/2011.02323v1
- Date: Wed, 4 Nov 2020 14:43:43 GMT
- Title: Indic-Transformers: An Analysis of Transformer Language Models for
Indian Languages
- Authors: Kushal Jain, Adwait Deshpande, Kumar Shridhar, Felix Laumann, Ayushman
Dash
- Abstract summary: Language models based on the Transformer architecture have achieved state-of-the-art performance on a wide range of NLP tasks.
However, this performance is usually tested and reported on high-resource languages, like English, French, Spanish, and German.
Indian languages, on the other hand, are underrepresented in such benchmarks.
- Score: 0.8155575318208631
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models based on the Transformer architecture have achieved
state-of-the-art performance on a wide range of NLP tasks such as text
classification, question-answering, and token classification. However, this
performance is usually tested and reported on high-resource languages, like
English, French, Spanish, and German. Indian languages, on the other hand, are
underrepresented in such benchmarks. Despite some Indian languages being
included in training multilingual Transformer models, they have not been the
primary focus of such work. In order to evaluate the performance on Indian
languages specifically, we analyze these language models through extensive
experiments on multiple downstream tasks in Hindi, Bengali, and Telugu
language. Here, we compare the efficacy of fine-tuning model parameters of
pre-trained models against that of training a language model from scratch.
Moreover, we empirically argue against the strict dependency between the
dataset size and model performance, but rather encourage task-specific model
and method selection. We achieve state-of-the-art performance on Hindi and
Bengali languages for text classification task. Finally, we present effective
strategies for handling the modeling of Indian languages and we release our
model checkpoints for the community :
https://huggingface.co/neuralspace-reverie.
Related papers
- Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - IndicSUPERB: A Speech Processing Universal Performance Benchmark for
Indian languages [16.121708272597154]
We release the IndicSUPERB benchmark for speech recognition in 12 Indian languages.
We train and evaluate different self-supervised models alongside a commonly used baseline benchmark.
We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks.
arXiv Detail & Related papers (2022-08-24T20:14:52Z) - Lifting the Curse of Multilinguality by Pre-training Modular
Transformers [72.46919537293068]
multilingual pre-trained models suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages.
We introduce language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant.
Our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.
arXiv Detail & Related papers (2022-05-12T17:59:56Z) - Cross-Lingual Fine-Grained Entity Typing [26.973783464706447]
We present a unified cross-lingual fine-grained entity typing model capable of handling over 100 languages.
We analyze this model's ability to generalize to languages and entities unseen during training.
arXiv Detail & Related papers (2021-10-15T03:22:30Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - HinFlair: pre-trained contextual string embeddings for pos tagging and
text classification in the Hindi language [0.0]
HinFlair is a language representation model (contextual string embeddings) pre-trained on a large monolingual Hindi corpus.
Results show that HinFlair outperforms previous state-of-the-art publicly available pre-trained embeddings for downstream tasks like text classification and pos tagging.
arXiv Detail & Related papers (2021-01-18T09:23:35Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Mono vs Multilingual Transformer-based Models: a Comparison across
Several Language Tasks [1.2691047660244335]
BERT (Bidirectional Representations from Transformers) and ALBERT (A Lite BERT) are methods for pre-training language models.
We make available our trained BERT and Albert model for Portuguese.
arXiv Detail & Related papers (2020-07-19T19:13:20Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.