On Using Distribution-Based Compositionality Assessment to Evaluate
Compositional Generalisation in Machine Translation
- URL: http://arxiv.org/abs/2311.08249v1
- Date: Tue, 14 Nov 2023 15:37:19 GMT
- Title: On Using Distribution-Based Compositionality Assessment to Evaluate
Compositional Generalisation in Machine Translation
- Authors: Anssi Moisio, Mathias Creutz, Mikko Kurimo
- Abstract summary: It is important to develop benchmarks to assess compositional generalisation in real-world natural language tasks.
This is done by splitting the Europarl translation corpus into a training and a test set in such a way that the test set requires compositional generalisation capacity.
This is a fully-automated procedure to create natural language compositionality benchmarks, making it simple and inexpensive to apply it further to other datasets and languages.
- Score: 10.840893953881652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compositional generalisation (CG), in NLP and in machine learning more
generally, has been assessed mostly using artificial datasets. It is important
to develop benchmarks to assess CG also in real-world natural language tasks in
order to understand the abilities and limitations of systems deployed in the
wild. To this end, our GenBench Collaborative Benchmarking Task submission
utilises the distribution-based compositionality assessment (DBCA) framework to
split the Europarl translation corpus into a training and a test set in such a
way that the test set requires compositional generalisation capacity.
Specifically, the training and test sets have divergent distributions of
dependency relations, testing NMT systems' capability of translating
dependencies that they have not been trained on. This is a fully-automated
procedure to create natural language compositionality benchmarks, making it
simple and inexpensive to apply it further to other datasets and languages. The
code and data for the experiments is available at
https://github.com/aalto-speech/dbca.
Related papers
- NLPre: a revised approach towards language-centric benchmarking of Natural Language Preprocessing systems [2.141587359797428]
It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-based morphological analysers or dictionaries.
Inspired by the GLUE benchmark, the proposed language-centric benchmarking system enables comprehensive ongoing evaluation of multiple NLPre tools.
The prototype application is configured for Polish and integrated with the thoroughly assembled NLPre-PL benchmark.
arXiv Detail & Related papers (2024-03-07T14:07:00Z) - On Evaluating Multilingual Compositional Generalization with Translated
Datasets [34.51457321680049]
We show that compositional generalization abilities differ across languages.
We craft a faithful rule-based translation of the MCWQ dataset from English to Chinese and Japanese.
Even with the resulting robust benchmark, which we call MCWQ-R, we show that the distribution of compositions still suffers due to linguistic divergences.
arXiv Detail & Related papers (2023-06-20T10:03:57Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - Statistical Machine Translation for Indic Languages [1.8899300124593648]
This paper canvasses about the development of bilingual Statistical Machine Translation models.
To create the system, MOSES open-source SMT toolkit is explored.
In our experiment, the quality of the translation is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2023-01-02T06:23:12Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z) - Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on
Spoken Language Understanding [101.24748444126982]
Decomposable tasks are complex and comprise of a hierarchy of sub-tasks.
Existing benchmarks, however, typically hold out examples for only the surface-level sub-task.
We propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions.
arXiv Detail & Related papers (2021-06-29T02:53:59Z) - TransQuest: Translation Quality Estimation with Cross-lingual
Transformers [14.403165053223395]
We propose a simple QE framework based on cross-lingual transformers.
We use it to implement and evaluate two different neural architectures.
Our evaluation shows that the proposed methods achieve state-of-the-art results.
arXiv Detail & Related papers (2020-11-01T16:34:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.