Self-tuning hyper-parameters for unsupervised cross-lingual tokenization
- URL: http://arxiv.org/abs/2303.02427v2
- Date: Tue, 4 Apr 2023 15:34:50 GMT
- Title: Self-tuning hyper-parameters for unsupervised cross-lingual tokenization
- Authors: Anton Kolonin
- Abstract summary: We implement the meta-learning approach for automatic determination of hyper- parameters of the unsupervised tokenization model.
We find a fairly good correlation between the additive combination of the former three metrics for English and Russian.
In case of Chinese, we find a significant correlation between the F 1 score and the compression factor.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the possibility of meta-learning for the language-independent
unsupervised tokenization problem for English, Russian, and Chinese. We
implement the meta-learning approach for automatic determination of
hyper-parameters of the unsupervised tokenization model proposed in earlier
works, relying on various human-independent fitness functions such as
normalised anti-entropy, compression factor and cross-split F1 score, as well
as additive and multiplicative composite combinations of the three metrics,
testing them against the conventional F1 tokenization score. We find a fairly
good correlation between the latter and the additive combination of the former
three metrics for English and Russian. In case of Chinese, we find a
significant correlation between the F 1 score and the compression factor. Our
results suggest the possibility of robust unsupervised tokenization of
low-resource and dead languages and allow us to think about human languages in
terms of the evolution of efficient symbolic communication codes with different
structural optimisation schemes that have evolved in different human cultures.
Related papers
- QU-NLP at CheckThat! 2025: Multilingual Subjectivity in News Articles Detection using Feature-Augmented Transformer Models with Sequential Cross-Lingual Fine-Tuning [0.21756081703275998]
This paper presents our approach to the CheckThat! 2025 Task 1 on subjectivity detection.<n>We propose a feature-augmented transformer architecture that combines contextual embeddings from pre-trained language models with statistical and linguistic features.<n>We evaluated our system in monolingual, multilingual, and zero-shot settings across multiple languages including English, Arabic, German, Italian, and several unseen languages.
arXiv Detail & Related papers (2025-07-01T13:39:59Z) - LAGO: Few-shot Crosslingual Embedding Inversion Attacks via Language Similarity-Aware Graph Optimization [4.274520108617021]
LAGO is a novel approach for few-shot cross-lingual embedding inversion attacks.<n>It explicitly models linguistic relationships through a graph-based constrained distributed optimization framework.<n>Experiments show it substantially improves the transferability of attacks with 10-20% increase in Rouge-L score over baselines.
arXiv Detail & Related papers (2025-05-21T20:48:24Z) - FUSE : A Ridge and Random Forest-Based Metric for Evaluating MT in Indigenous Languages [2.377892000761193]
This paper presents the winning submission of the RaaVa team to the Americas 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine Translation.
We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge regression and Gradient Boosting to model translation quality.
Results show that FUSE consistently achieves higher Pearson and Spearman correlations with human judgments.
arXiv Detail & Related papers (2025-03-28T06:58:55Z) - Balanced Multi-Factor In-Context Learning for Multilingual Large Language Models [53.38288894305388]
Multilingual large language models (MLLMs) are able to leverage in-context learning (ICL) to achieve high performance by leveraging cross-lingual knowledge transfer without parameter updates.
Three key factors influence multilingual ICL: (1) semantic similarity, (2) linguistic alignment, and (3) language-specific performance.
We propose balanced multi-factor ICL (textbfBMF-ICL), a method that quantifies and optimally balances these factors for improved example selection.
arXiv Detail & Related papers (2025-02-17T06:56:33Z) - Enhancing Idiomatic Representation in Multiple Languages via an Adaptive Contrastive Triplet Loss [9.807885676930308]
We propose an approach to model idiomaticity using a triplet loss that incorporates the asymmetric contribution of components words to an idiomatic meaning for training language models.
Our proposed method is evaluated on a SemEval challenge and outperforms previous alternatives significantly in many metrics.
arXiv Detail & Related papers (2024-06-21T14:21:41Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Evolution of Efficient Symbolic Communication Codes [0.0]
The paper explores how the human natural language structure can be seen as a product of evolution of inter-personal communication code.
It aims to maximise such culture-agnostic and cross-lingual metrics such as anti-entropy, compression factor and cross-split F1 score.
arXiv Detail & Related papers (2023-06-04T15:33:16Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - On the Relation between Syntactic Divergence and Zero-Shot Performance [22.195133438732633]
We take the transfer of Universal Dependencies (UD) parsing from English to a diverse set of languages and conduct two sets of experiments.
We analyze zero-shot performance based on the extent to which English source edges are preserved in translation.
In both sets of experiments, our results suggest a strong relation between cross-lingual stability and zero-shot parsing performance.
arXiv Detail & Related papers (2021-10-09T21:09:21Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - A Modest Pareto Optimisation Analysis of Dependency Parsers in 2021 [0.38073142980733]
We evaluate three leading dependency systems from different paradigms on a small yet diverse subset languages.
As we are interested in efficiency, we evaluate cores without pretrained language models.
Biaffine parsing emerges as a well-balanced default choice.
arXiv Detail & Related papers (2021-06-08T09:55:47Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - Constructing a Family Tree of Ten Indo-European Languages with
Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns.
This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z) - Modeling Voting for System Combination in Machine Translation [92.09572642019145]
We propose an approach to modeling voting for system combination in machine translation.
Our approach combines the advantages of statistical and neural methods since it can not only analyze the relations between hypotheses but also allow for end-to-end training.
arXiv Detail & Related papers (2020-07-14T09:59:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.