Towards Typologically Aware Rescoring to Mitigate Unfaithfulness in Lower-Resource Languages
- URL: http://arxiv.org/abs/2502.17664v2
- Date: Sun, 09 Mar 2025 08:17:58 GMT
- Title: Towards Typologically Aware Rescoring to Mitigate Unfaithfulness in Lower-Resource Languages
- Authors: Tsan Tsai Chan, Xin Tong, Thi Thu Uyen Hoang, Barbare Tepnadze, Wojciech Stempniak,
- Abstract summary: Multilingual large language models generate non-faithful output in resource-constrained languages.<n>To mitigate unfaithfulness in such settings, we propose using computationally light auxiliary models to rescore the outputs of larger architectures.<n>We show that monolingual 4-layer BERT models pretrained from scratch on less than 700 MB of data without fine-tuning are able to identify faithful summaries with a mean accuracy of 88.33%.
- Score: 9.426642998924724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual large language models (LLMs) are known to more frequently generate non-faithful output in resource-constrained languages (Guerreiro et al., 2023 - arXiv:2303.16104), potentially because these typologically diverse languages are underrepresented in their training data. To mitigate unfaithfulness in such settings, we propose using computationally light auxiliary models to rescore the outputs of larger architectures. As proof of the feasibility of such an approach, we show that monolingual 4-layer BERT models pretrained from scratch on less than 700 MB of data without fine-tuning are able to identify faithful summaries with a mean accuracy of 88.33% in three genetically unrelated languages that differ in their morphological complexity - Vietnamese, Polish and Georgian. The same hyperparameter combination moreover generalises well to three other tasks, suggesting applications for rescoring beyond improving faithfulness. In order to inform typologically aware model selection, we also investigate how morphological complexity interacts with regularisation, model depth and training objectives, ultimately demonstrating that morphologically complex languages are more likely to benefit from dropout, while across languages downstream performance is enhanced most by shallow architectures as well as training using the standard BERT objectives.
Related papers
- Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages [0.43498389175652036]
This study integrates traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages.
We demonstrate substantial improvements in word error rate, particularly in low-resource scenarios.
While the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters.
arXiv Detail & Related papers (2025-03-30T18:03:52Z) - Cross-lingual transfer of multilingual models on low resource African Languages [0.20793001310272596]
monolingual models, trained on a single language, may better capture the nuances of the target language.
AfriBERT achieved the highest cross-lingual accuracy of 88.3% after fine-tuning.
BiGRU emerged as the best-performing neural model with 83.3% accuracy.
arXiv Detail & Related papers (2024-09-17T08:05:40Z) - LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages [8.754506364968394]
The LingOly benchmark is a novel benchmark for advanced reasoning abilities in large language models.
We evaluate capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages.
We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation.
arXiv Detail & Related papers (2024-06-10T11:50:29Z) - Improving Cross-lingual Information Retrieval on Low-Resource Languages
via Optimal Transport Distillation [21.057178077747754]
In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval.
By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training.
Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages.
arXiv Detail & Related papers (2023-01-29T22:30:36Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Learning to Learn Morphological Inflection for Resource-Poor Languages [105.11499402984482]
We propose to cast the task of morphological inflection - mapping a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem.
Treating each language as a separate task, we use data from high-resource source languages to learn a set of model parameters.
Experiments with two model architectures on 29 target languages from 3 families show that our suggested approach outperforms all baselines.
arXiv Detail & Related papers (2020-04-28T05:13:17Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.