Related papers: Character-level NMT and language similarity

Character-level NMT and language similarity

URL: http://arxiv.org/abs/2308.04398v1
Date: Tue, 8 Aug 2023 17:01:42 GMT
Title: Character-level NMT and language similarity
Authors: Josef Jon and Ond\v{r}ej Bojar
Abstract summary: We explore the effectiveness of character-level neural machine translation for various levels of language similarity and size of the training dataset on translation between Czech and Croatian, German, Hungarian, Slovak, and Spanish. We evaluate the models using automatic MT metrics and show that translation between similar languages benefits from character-level input segmentation. We confirm previous findings that it is possible to close the gap by finetuning the already trained subword-level models to character-level.
Score: 1.90365714903665
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We explore the effectiveness of character-level neural machine translation using Transformer architecture for various levels of language similarity and size of the training dataset on translation between Czech and Croatian, German, Hungarian, Slovak, and Spanish. We evaluate the models using automatic MT metrics and show that translation between similar languages benefits from character-level input segmentation, while for less related languages, character-level vanilla Transformer-base often lags behind subword-level segmentation. We confirm previous findings that it is possible to close the gap by finetuning the already trained subword-level models to character-level.

Related papers

From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time [8.28573483085828]
We leverage an annotated Danish morphological dataset to train a semi-supervised model for morphological segmentation. We evaluate four distinct tokenizers, including two custom morphological tokenizers, by analyzing their performance intextly segmenting Danish words. Our findings reveal that our custom-developed tokenizers substantially enhance morphological segmentation, achieving an F1 score of 58.84, compared to 39.28 achieved by a Danish BPE tokenizer.
arXiv Detail & Related papers (2025-04-02T09:26:02Z)
Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process.<n>We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing.<n>We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z)
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z)
T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest. We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z)
Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations. We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z)
Subword Segmental Machine Translation: Unifying Segmentation and Target Sentence Generation [7.252933737829635]
Subword segmental machine translation (SSMT) learns to segment target sentence words while jointly learning to generate target sentences. Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages.
arXiv Detail & Related papers (2023-05-11T17:44:29Z)
Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5 for Machine Translation [9.736284584478032]
We show the effectiveness of character-level modeling in translation, particularly in cases where fine-tuning data is limited. While evaluating the importance of source texts in driving model predictions, we highlight word-level patterns within ByT5. We conclude by assessing the efficiency tradeoff of byte models, suggesting their usage in non-time-critical scenarios to boost translation quality.
arXiv Detail & Related papers (2023-02-28T00:50:19Z)
Multilingual Extraction and Categorization of Lexical Collocations with Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context. Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)
VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z)
Character-level Transformer-based Neural Machine Translation [5.699756532377753]
We discuss a novel, Transformer-based approach, that we compare, both in speed and in quality to the Transformer at subword and character levels. We evaluate our models on 4 language pairs from WMT'15: DE-EN, CS-EN, FI-EN and RU-EN. The proposed novel architecture can be trained on a single GPU and is 34% percent faster than the character-level Transformer.
arXiv Detail & Related papers (2020-05-22T15:40:43Z)
Character-Level Translation with Self-attention [9.864260997723974]
We explore the suitability of self-attention models for character-level neural machine translation. We test the standard transformer model and a novel variant in which the encoder block combines information from nearby characters using convolutions. Our transformer variant consistently outperforms the standard transformer at the character-level and converges faster while learning more robust character-level alignments.
arXiv Detail & Related papers (2020-04-30T14:05:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.