Related papers: BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation

Related papers

The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations [0.0]
This paper reports on a large tagged parallel dataset which has been developed to partially address this issue.<n>The taggedPBC contains POS-tagged parallel text data from more than 1,940 languages, representing 155 language families and 78 isolates.<n>The accuracy of particular tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages.
arXiv Detail & Related papers (2025-05-18T22:13:32Z)
IndicSQuAD: A Comprehensive Multilingual Question Answering Dataset for Indic Languages [0.4194295877935868]
We present IndicSQuAD, a comprehensive multi-lingual extractive QA dataset covering nine major Indic languages.<n>IndicSQuAD comprises extensive training, validation, and test sets for each language.<n>We evaluate baseline performances using language-specific monolingual BERT models and the multilingual MuRIL-BERT.
arXiv Detail & Related papers (2025-05-06T16:42:54Z)
Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review [2.377892000761193]
This review provides a comprehensive overview of available parallel corpora for Indic languages. We critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, and data scarcity. We outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality.
arXiv Detail & Related papers (2025-03-02T21:22:53Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z)
UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields. In this work, we construct an open-source multilingual supervised fine-tuning dataset. The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z)
Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval [62.82448161570428]
This dataset is designed to investigate fairness in a multilingual information retrieval context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages. It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
arXiv Detail & Related papers (2023-11-03T12:29:11Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems [64.40789703661987]
Multi3WOZ is a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages. We describe a complex bottom-up data collection process that yielded the final dataset.
arXiv Detail & Related papers (2023-07-26T08:29:42Z)
Soft Prompt Decoding for Multilingual Dense Retrieval [30.766917713997355]
We show that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance. This is due to the heterogeneous and imbalanced nature of multilingual collections. We present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space.
arXiv Detail & Related papers (2023-05-15T21:17:17Z)
Large Scale Multi-Lingual Multi-Modal Summarization Dataset [26.92121230628835]
We present the current largest multi-lingual multi-modal summarization dataset (M3LS) It consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair. It is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages.
arXiv Detail & Related papers (2023-02-13T18:00:23Z)
MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue [115.32009638844059]
We extend the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages. Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals. We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the Natural Language Understanding tasks of intent detection and slot labelling.
arXiv Detail & Related papers (2022-12-20T17:34:25Z)
Revamping Multilingual Agreement Bidirectionally via Switched Back-translation for Multilingual Neural Machine Translation [107.83158521848372]
multilingual agreement (MA) has shown its importance for multilingual neural machine translation (MNMT) We present textbfBidirectional textbfMultilingual textbfAgreement via textbfSwitched textbfBack-textbftranslation (textbfBMA-SBT) It is a novel and universal multilingual agreement framework for fine-tuning pre-trained MNMT models.
arXiv Detail & Related papers (2022-09-28T09:14:58Z)
Bridging Cross-Lingual Gaps During Leveraging the Multilingual Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages. Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z)
MultiEURLEX -- A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer [13.24356999779404]
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target)
arXiv Detail & Related papers (2021-09-02T12:52:55Z)
Complete Multilingual Neural Machine Translation [44.98358050355681]
We study the use of multi-way aligned examples to enrich the original English-centric parallel corpora. We call MNMT with such connectivity pattern complete Multilingual Neural Machine Translation (cMNMT) In combination with a novel training data sampling strategy that is conditioned on the target language only, cMNMT yields competitive translation quality for all language pairs.
arXiv Detail & Related papers (2020-10-20T13:03:48Z)
X-SRL: A Parallel Cross-Lingual Semantic Role Labeling Dataset [18.389328059694037]
In this work, we propose a method to automatically construct an SRL corpus that is parallel in four languages: English, French, German, Spanish, with unified predicate and role annotations that are fully comparable across languages. We include human-validated test sets that we use to measure the projection quality, and show that projection is denser and more precise than a strong baseline. Finally, we train different SOTA models on our novel corpus for mono- and multilingual SRL, showing that the multilingual annotations improve performance especially for the weaker languages.
arXiv Detail & Related papers (2020-10-05T13:34:20Z)
A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks. Datasize and context window size are crucial factors to the transferability. There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.