A High-Quality Multilingual Dataset for Structured Documentation
  Translation
        - URL: http://arxiv.org/abs/2006.13425v1
- Date: Wed, 24 Jun 2020 02:08:44 GMT
- Title: A High-Quality Multilingual Dataset for Structured Documentation
  Translation
- Authors: Kazuma Hashimoto, Raffaella Buschiazzo, James Bradbury, Teresa
  Marshall, Richard Socher, Caiming Xiong
- Abstract summary: This paper presents a high-quality multilingual dataset for the documentation domain.
We collect XML-structured parallel text segments from the online documentation for an enterprise software platform.
- Score: 101.41835967142521
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   This paper presents a high-quality multilingual dataset for the documentation
domain to advance research on localization of structured text. Unlike
widely-used datasets for translation of plain text, we collect XML-structured
parallel text segments from the online documentation for an enterprise software
platform. These Web pages have been professionally translated from English into
16 languages and maintained by domain experts, and around 100,000 text segments
are available for each language pair. We build and evaluate translation models
for seven target languages from English, with several different copy mechanisms
and an XML-constrained beam search. We also experiment with a non-English pair
to show that our dataset has the potential to explicitly enable $17 \times 16$
translation settings. Our experiments show that learning to translate with the
XML tags improves translation accuracy, and the beam search accurately
generates XML structures. We also discuss trade-offs of using the copy
mechanisms by focusing on translation of numerical words and named entities. We
further provide a detailed human analysis of gaps between the model output and
human translations for real-world applications, including suitability for
post-editing.
 
      
        Related papers
        - ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song   lyrics, novels, and subtitles, with English translations [0.0]
 ArzEn-MultiGenre is a parallel dataset of Egyptian Arabic song lyrics, novels, and TV show subtitles that are manually translated and aligned with their English counterparts.<n>The dataset contains 25,557 segment pairs that can be used to benchmark new machine translation models, fine-tune large language models in few-shot settings, and adapt commercial machine translation applications such as Google Translate.
 arXiv  Detail & Related papers  (2025-08-02T15:28:41Z)
- BOUQuET: dataset, Benchmark and Open initiative for Universal Quality   Evaluation in Translation [28.456351723077088]
 BOUQuET is a multi-way, multicentric and multi-register/domain dataset and benchmark.<n>This dataset is handcrafted in 8 non-English languages.
 arXiv  Detail & Related papers  (2025-02-06T18:56:37Z)
- X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and
  Few-shot Agents [43.446606562545085]
 We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages.
X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language.
We develop a toolset to accelerate the post-editing of a new language dataset after translation.
 arXiv  Detail & Related papers  (2023-06-30T14:03:30Z)
- Decomposed Prompting for Machine Translation Between Related Languages
  using Large Language Models [55.35106713257871]
 We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations.
We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
 arXiv  Detail & Related papers  (2023-05-22T14:52:47Z)
- Taxi1500: A Multilingual Dataset for Text Classification in 1500   Languages [40.01333053375582]
 We aim to create a text classification dataset encompassing a large number of languages.
We leverage parallel translations of the Bible to construct such a dataset.
By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages.
 arXiv  Detail & Related papers  (2023-05-15T09:43:32Z)
- Building Machine Translation Systems for the Next Thousand Languages [102.24310122155073]
 We describe results in three research domains: building clean, web-mined datasets for 1500+ languages, developing practical MT models for under-served languages, and studying the limitations of evaluation metrics for these languages.
We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
 arXiv  Detail & Related papers  (2022-05-09T00:24:13Z)
- Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
 We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
 arXiv  Detail & Related papers  (2022-02-19T11:55:40Z)
- Scalable Cross-lingual Document Similarity through Language-specific
  Concept Hierarchies [0.0]
 This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora.
The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels.
Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.
 arXiv  Detail & Related papers  (2020-12-15T10:42:40Z)
- The Tatoeba Translation Challenge -- Realistic Data Sets for Low
  Resource and Multilingual MT [0.0]
 This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
 arXiv  Detail & Related papers  (2020-10-13T13:12:21Z)
- A Parallel Evaluation Data Set of Software Documentation with Document
  Structure Annotation [0.0]
 The data set comprises of the language pairs English to Hindi, Indonesian, Malay and Thai.
We provide insights into the origin and creation, the particularities and characteristics of the data set as well as machine translation results.
 arXiv  Detail & Related papers  (2020-08-11T06:50:23Z)
- A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
 We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
 arXiv  Detail & Related papers  (2020-05-06T04:46:11Z)
- Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
  Lexical Semantic Similarity [67.36239720463657]
 Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
 arXiv  Detail & Related papers  (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.