HLTCOE at TREC 2023 NeuCLIR Track
- URL: http://arxiv.org/abs/2404.08118v1
- Date: Thu, 11 Apr 2024 20:46:18 GMT
- Title: HLTCOE at TREC 2023 NeuCLIR Track
- Authors: Eugene Yang, Dawn Lawrie, James Mayfield,
- Abstract summary: The HLT team applied PLAID, an mT5 reranker, and document translation to the TREC 2023 NeuCLIR track.
For PLAID we included a variety of models and training techniques -- the English model released with ColBERT v2, translate-train(TT), Translate Distill(TD) and translate multilingual-train(MTT)
- Score: 10.223578525761617
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The HLTCOE team applied PLAID, an mT5 reranker, and document translation to the TREC 2023 NeuCLIR track. For PLAID we included a variety of models and training techniques -- the English model released with ColBERT v2, translate-train~(TT), Translate Distill~(TD) and multilingual translate-train~(MTT). TT trains a ColBERT model with English queries and passages automatically translated into the document language from the MS-MARCO v1 collection. This results in three cross-language models for the track, one per language. MTT creates a single model for all three document languages by combining the translations of MS-MARCO passages in all three languages into mixed-language batches. Thus the model learns about matching queries to passages simultaneously in all languages. Distillation uses scores from the mT5 model over non-English translated document pairs to learn how to score query-document pairs. The team submitted runs to all NeuCLIR tasks: the CLIR and MLIR news task as well as the technical documents task.
Related papers
- Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - Distillation for Multilingual Information Retrieval [10.223578525761617]
Translate-Distill framework trains a cross-language neural dual-encoder model using translation and distillation.
This work extends Translate-Distill and propose Translate-Distill (MTD) for Multilingual information retrieval.
We show that ColBERT-X models trained with MTD outperform their counterparts trained ith Multilingual Translate-Train, by 5% to 25% in nDCG@20 and 15% to 45% in MAP.
arXiv Detail & Related papers (2024-05-02T03:30:03Z) - CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource
Languages [0.769672852567215]
CML-TTS is a new Text-to-Speech (TTS) dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias (UFG)
CML-TTS is based on Multilingual LibriSpeech (MLS) and adapted for training TTS models, consisting of audiobooks in seven languages: Dutch, French, German, Italian, Portuguese, Polish, and Spanish.
We provide the YourTTS model, a multi-lingual TTS model, trained using 3,176.13 hours from CML-TTS and also with 245.07 hours from LibriTTS, in English.
arXiv Detail & Related papers (2023-06-16T17:17:06Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - Multilingual ColBERT-X [11.768656900939048]
ColBERT-X is a dense retrieval model for Cross Language Information Retrieval ( CLIR)
In CLIR, documents are written in one natural language, while the queries are expressed in another.
A related task is multilingual IR (MLIR) where the system creates a single ranked list of documents written in many languages.
arXiv Detail & Related papers (2022-09-03T06:02:52Z) - Building Machine Translation Systems for the Next Thousand Languages [102.24310122155073]
We describe results in three research domains: building clean, web-mined datasets for 1500+ languages, developing practical MT models for under-served languages, and studying the limitations of evaluation metrics for these languages.
We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
arXiv Detail & Related papers (2022-05-09T00:24:13Z) - Multilingual Machine Translation Systems from Microsoft for WMT21 Shared
Task [95.06453182273027]
This report describes Microsoft's machine translation systems for the WMT21 shared task on large-scale multilingual machine translation.
Our model submissions to the shared task were with DeltaLMnotefooturlhttps://aka.ms/deltalm, a generic pre-trained multilingual-decoder model.
Our final submissions ranked first on three tracks in terms of the automatic evaluation metric.
arXiv Detail & Related papers (2021-11-03T09:16:17Z) - mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks.
We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z) - SJTU-NICT's Supervised and Unsupervised Neural Machine Translation
Systems for the WMT20 News Translation Task [111.91077204077817]
We participated in four translation directions of three language pairs: English-Chinese, English-Polish, and German-Upper Sorbian.
Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques.
In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.
arXiv Detail & Related papers (2020-10-11T00:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.