Related papers: Finding the Optimal Vocabulary Size for Neural Machine Translation

Finding the Optimal Vocabulary Size for Neural Machine Translation

URL: http://arxiv.org/abs/2004.02334v2
Date: Mon, 5 Oct 2020 15:19:16 GMT
Title: Finding the Optimal Vocabulary Size for Neural Machine Translation
Authors: Thamme Gowda, Jonathan May
Abstract summary: We cast neural machine translation (NMT) as a classification task in an autoregressive setting. We analyze the limitations of both classification and autoregression components. We reveal an explanation for why certain vocabulary sizes are better than others.
Score: 25.38870582223696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We cast neural machine translation (NMT) as a classification task in an autoregressive setting and analyze the limitations of both classification and autoregression components. Classifiers are known to perform better with balanced class distributions during training. Since the Zipfian nature of languages causes imbalanced classes, we explore its effect on NMT. We analyze the effect of various vocabulary sizes on NMT performance on multiple languages with many data sizes, and reveal an explanation for why certain vocabulary sizes are better than others.

Related papers

Self-Vocabularizing Training for Neural Machine Translation [15.700883057259931]
We observe that trained translation models are induced to use a byte-pair encoding subset (BPE) vocabulary iteration distinct from the original BPE vocabulary. We propose self-vocabularizing training, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement.
arXiv Detail & Related papers (2025-03-18T02:21:07Z)
T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest. We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z)
Dict-NMT: Bilingual Dictionary based NMT for Extremely Low Resource Languages [1.8787713898828164]
We present a detailed analysis of the effects of the quality of dictionaries, training dataset size, language family, etc., on the translation quality. Results on multiple low-resource test languages show a clear advantage of our bilingual dictionary-based method over the baselines.
arXiv Detail & Related papers (2022-06-09T12:03:29Z)
How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training? [86.48323488619629]
We analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus. We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected.
arXiv Detail & Related papers (2022-04-29T17:50:36Z)
DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus. We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z)
Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT [64.1841519527504]
neural machine translation uses a single neural network to model the entire translation process. Despite neural machine translation being de-facto standard, it is still not clear how NMT models acquire different competences over the course of training.
arXiv Detail & Related papers (2021-09-03T09:38:50Z)
Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT) Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder. We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z)
Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora [36.292020779233056]
This work presents an alternative approach which learns weights for multiple sentence-level features. We apply this technique to building Neural Machine Translation (NMT) systems using the Paracrawl corpus for Estonian-English. We analyze the sensitivity of this method to different types of noise and explore if the learned weights generalize to other language pairs.
arXiv Detail & Related papers (2021-03-11T21:45:45Z)
Linguistic Profiling of a Neural Language Model [1.0552465253379135]
We investigate the linguistic knowledge learned by a Neural Language Model (NLM) before and after a fine-tuning process. We show that BERT is able to encode a wide range of linguistic characteristics, but it tends to lose this information when trained on specific downstream tasks.
arXiv Detail & Related papers (2020-10-05T09:09:01Z)
Balancing Training for Multilingual Neural Machine Translation [130.54253367251738]
multilingual machine translation (MT) models can translate to/from multiple languages. Standard practice is to up-sample less resourced languages to increase representation. We propose a method that instead automatically learns how to weight training data through a data scorer.
arXiv Detail & Related papers (2020-04-14T18:23:28Z)
Morphological Word Segmentation on Agglutinative Languages for Neural Machine Translation [8.87546236839959]
We propose a morphological word segmentation method on the source-side for Neural machine translation (NMT) It incorporates morphology knowledge to preserve the linguistic and semantic information in the word structure while reducing the vocabulary size at training time. It can be utilized as a preprocessing tool to segment the words in agglutinative languages for other natural language processing (NLP) tasks.
arXiv Detail & Related papers (2020-01-02T10:05:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.