Trankit: A Light-Weight Transformer-based Toolkit for Multilingual
Natural Language Processing
- URL: http://arxiv.org/abs/2101.03289v4
- Date: Thu, 11 Mar 2021 04:27:46 GMT
- Title: Trankit: A Light-Weight Transformer-based Toolkit for Multilingual
Natural Language Processing
- Authors: Minh Van Nguyen, Viet Lai, Amir Pouran Ben Veyseh, and Thien Huu
Nguyen
- Abstract summary: Trankit is a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP)
It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages.
Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing.
- Score: 22.38792093462942
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce Trankit, a light-weight Transformer-based Toolkit for
multilingual Natural Language Processing (NLP). It provides a trainable
pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained
pipelines for 56 languages. Built on a state-of-the-art pretrained language
model, Trankit significantly outperforms prior multilingual NLP pipelines over
sentence segmentation, part-of-speech tagging, morphological feature tagging,
and dependency parsing while maintaining competitive performance for
tokenization, multi-word token expansion, and lemmatization over 90 Universal
Dependencies treebanks. Despite the use of a large pretrained transformer, our
toolkit is still efficient in memory usage and speed. This is achieved by our
novel plug-and-play mechanism with Adapters where a multilingual pretrained
transformer is shared across pipelines for different languages. Our toolkit
along with pretrained models and code are publicly available at:
https://github.com/nlp-uoregon/trankit. A demo website for our toolkit is also
available at: http://nlp.uoregon.edu/trankit. Finally, we create a demo video
for Trankit at: https://youtu.be/q0KGP3zGjGc.
Related papers
- Hindi to English: Transformer-Based Neural Machine Translation [0.0]
We have developed a Machine Translation (NMT) system by training the Transformer model to translate texts from Indian Language Hindi to English.
We implemented back-translation to augment the training data and for creating the vocabulary.
This led us to achieve a state-of-the-art BLEU score of 24.53 on the test set of IIT Bombay English-Hindi Corpus.
arXiv Detail & Related papers (2023-09-23T00:00:09Z) - The VolcTrans System for WMT22 Multilingual Machine Translation Task [24.300726424411007]
VolcTrans is a transformerbased multilingual model trained on data from multiple sources.
A series of rules clean both bilingual and monolingual texts.
Our system achieves 17.3 BLEU, 21.9 spBLEU, and 41.9 chrF2++ on average over all language pairs.
arXiv Detail & Related papers (2022-10-20T21:18:03Z) - Language-Family Adapters for Low-Resource Multilingual Neural Machine
Translation [129.99918589405675]
Large multilingual models trained with self-supervision achieve state-of-the-art results in a wide range of natural language processing tasks.
Multilingual fine-tuning improves performance on low-resource languages but requires modifying the entire model and can be prohibitively expensive.
We propose training language-family adapters on top of mBART-50 to facilitate cross-lingual transfer.
arXiv Detail & Related papers (2022-09-30T05:02:42Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - Lightweight Adapter Tuning for Multilingual Speech Translation [47.89784337058167]
Adapter modules were recently introduced as an efficient alternative to fine-tuning in NLP.
This paper proposes a comprehensive analysis of adapters for multilingual speech translation.
arXiv Detail & Related papers (2021-06-02T20:51:42Z) - XLM-T: Scaling up Multilingual Machine Translation with Pretrained
Cross-lingual Transformer Encoders [89.0059978016914]
We present XLM-T, which initializes the model with an off-the-shelf pretrained cross-lingual Transformer and fine-tunes it with multilingual parallel data.
This simple method achieves significant improvements on a WMT dataset with 10 language pairs and the OPUS-100 corpus with 94 pairs.
arXiv Detail & Related papers (2020-12-31T11:16:51Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Glancing Transformer for Non-Autoregressive Neural Machine Translation [58.87258329683682]
We propose a method to learn word interdependency for single-pass parallel generation models.
With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8-15 times speedup.
arXiv Detail & Related papers (2020-08-18T13:04:03Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.