MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel
Corpora
- URL: http://arxiv.org/abs/2005.10583v1
- Date: Thu, 21 May 2020 11:46:44 GMT
- Title: MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel
Corpora
- Authors: Lifeng Han, Gareth J.F. Jones and Alan F. Smeaton
- Abstract summary: Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP)
The availability of bilingual or multi-lingual MWE corpora is very limited.
We present a collection of 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering.
- Score: 14.105783620789667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-word expressions (MWEs) are a hot topic in research in natural language
processing (NLP), including topics such as MWE detection, MWE decomposition,
and research investigating the exploitation of MWEs in other NLP fields such as
Machine Translation. However, the availability of bilingual or multi-lingual
MWE corpora is very limited. The only bilingual MWE corpora that we are aware
of is from the PARSEME (PARSing and Multi-word Expressions) EU Project. This is
a small collection of only 871 pairs of English-German MWEs. In this paper, we
present multi-lingual and bilingual MWE corpora that we have extracted from
root parallel corpora. Our collections are 3,159,226 and 143,042 bilingual MWE
pairs for German-English and Chinese-English respectively after filtering. We
examine the quality of these extracted bilingual MWEs in MT experiments. Our
initial experiments applying MWEs in MT show improved translation performances
on MWE terms in qualitative analysis and better general evaluation scores in
quantitative analysis, on both German-English and Chinese-English language
pairs. We follow a standard experimental pipeline to create our MultiMWE
corpora which are available online. Researchers can use this free corpus for
their own models or use them in a knowledge base as model features.
Related papers
- On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations? [19.346078451375693]
We present an analysis of existing evaluation frameworks in NLP.
We propose several directions for more robust and reliable evaluation practices.
We show that simpler baselines can achieve relatively strong performance without having benefited from large-scale multilingual pretraining.
arXiv Detail & Related papers (2024-06-20T12:46:12Z) - Low-Resource Machine Translation through Retrieval-Augmented LLM Prompting: A Study on the Mambai Language [1.1702440973773898]
This study explores the use of large language models for translating English into Mambai, a low-resource Austronesian language spoken in Timor-Leste.
Our methodology involves the strategic selection of parallel sentences and dictionary entries for prompting.
We find that including dictionary entries in prompts and a mix of sentences retrieved through-IDF and semantic embeddings significantly improves translation quality.
arXiv Detail & Related papers (2024-04-07T05:04:38Z) - Towards Building Multilingual Language Model for Medicine [54.1382395897071]
We construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages.
We propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench.
Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks.
arXiv Detail & Related papers (2024-02-21T17:47:20Z) - LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
Translation [94.33019040320507]
Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features.
Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases.
We propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages.
arXiv Detail & Related papers (2022-10-19T12:21:39Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - Towards the Next 1000 Languages in Multilingual Machine Translation:
Exploring the Synergy Between Supervised and Self-Supervised Learning [48.15259834021655]
We present a pragmatic approach towards building a multilingual machine translation model that covers hundreds of languages.
We use a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs.
We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting.
arXiv Detail & Related papers (2022-01-09T23:36:44Z) - Efficient Inference for Multilingual Neural Machine Translation [60.10996883354372]
We consider several ways to make multilingual NMT faster at inference without degrading its quality.
Our experiments demonstrate that combining a shallow decoder with vocabulary filtering leads to more than twice faster inference with no loss in translation quality.
arXiv Detail & Related papers (2021-09-14T13:28:13Z) - AlphaMWE: Construction of Multilingual Parallel Corpora with MWE
Annotations [5.8010446129208155]
We present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs)
The languages covered include English, Chinese, Polish, and German.
We present a categorisation of the error types encountered by MT systems in performing MWE related translation.
arXiv Detail & Related papers (2020-11-07T14:28:54Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.