Character-level NMT and language similarity
- URL: http://arxiv.org/abs/2308.04398v1
- Date: Tue, 8 Aug 2023 17:01:42 GMT
- Title: Character-level NMT and language similarity
- Authors: Josef Jon and Ond\v{r}ej Bojar
- Abstract summary: We explore the effectiveness of character-level neural machine translation for various levels of language similarity and size of the training dataset on translation between Czech and Croatian, German, Hungarian, Slovak, and Spanish.
We evaluate the models using automatic MT metrics and show that translation between similar languages benefits from character-level input segmentation.
We confirm previous findings that it is possible to close the gap by finetuning the already trained subword-level models to character-level.
- Score: 1.90365714903665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We explore the effectiveness of character-level neural machine translation
using Transformer architecture for various levels of language similarity and
size of the training dataset on translation between Czech and Croatian, German,
Hungarian, Slovak, and Spanish. We evaluate the models using automatic MT
metrics and show that translation between similar languages benefits from
character-level input segmentation, while for less related languages,
character-level vanilla Transformer-base often lags behind subword-level
segmentation. We confirm previous findings that it is possible to close the gap
by finetuning the already trained subword-level models to character-level.
Related papers
- MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [75.2540291039202]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost.
We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Decomposed Prompting for Machine Translation Between Related Languages
using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations.
We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z) - Subword Segmental Machine Translation: Unifying Segmentation and Target
Sentence Generation [7.252933737829635]
Subword segmental machine translation (SSMT) learns to segment target sentence words while jointly learning to generate target sentences.
Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages.
arXiv Detail & Related papers (2023-05-11T17:44:29Z) - Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5
for Machine Translation [9.736284584478032]
We show the effectiveness of character-level modeling in translation, particularly in cases where fine-tuning data is limited.
While evaluating the importance of source texts in driving model predictions, we highlight word-level patterns within ByT5.
We conclude by assessing the efficiency tradeoff of byte models, suggesting their usage in non-time-critical scenarios to boost translation quality.
arXiv Detail & Related papers (2023-02-28T00:50:19Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Character-level Transformer-based Neural Machine Translation [5.699756532377753]
We discuss a novel, Transformer-based approach, that we compare, both in speed and in quality to the Transformer at subword and character levels.
We evaluate our models on 4 language pairs from WMT'15: DE-EN, CS-EN, FI-EN and RU-EN.
The proposed novel architecture can be trained on a single GPU and is 34% percent faster than the character-level Transformer.
arXiv Detail & Related papers (2020-05-22T15:40:43Z) - Character-Level Translation with Self-attention [9.864260997723974]
We explore the suitability of self-attention models for character-level neural machine translation.
We test the standard transformer model and a novel variant in which the encoder block combines information from nearby characters using convolutions.
Our transformer variant consistently outperforms the standard transformer at the character-level and converges faster while learning more robust character-level alignments.
arXiv Detail & Related papers (2020-04-30T14:05:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.