A Parallel Evaluation Data Set of Software Documentation with Document
Structure Annotation
- URL: http://arxiv.org/abs/2008.04550v2
- Date: Thu, 12 Nov 2020 14:15:36 GMT
- Title: A Parallel Evaluation Data Set of Software Documentation with Document
Structure Annotation
- Authors: Bianka Buschbeck and Miriam Exel
- Abstract summary: The data set comprises of the language pairs English to Hindi, Indonesian, Malay and Thai.
We provide insights into the origin and creation, the particularities and characteristics of the data set as well as machine translation results.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper accompanies the software documentation data set for machine
translation, a parallel evaluation data set of data originating from the SAP
Help Portal, that we released to the machine translation community for research
purposes. It offers the possibility to tune and evaluate machine translation
systems in the domain of corporate software documentation and contributes to
the availability of a wider range of evaluation scenarios. The data set
comprises of the language pairs English to Hindi, Indonesian, Malay and Thai,
and thus also increases the test coverage for the many low-resource language
pairs. Unlike most evaluation data sets that consist of plain parallel text,
the segments in this data set come with additional metadata that describes
structural information of the document context. We provide insights into the
origin and creation, the particularities and characteristics of the data set as
well as machine translation results.
Related papers
- LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - Dataset of Quotation Attribution in German News Articles [19.222705178881558]
We present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS.
The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens)
arXiv Detail & Related papers (2024-04-25T17:19:13Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - Backretrieval: An Image-Pivoted Evaluation Metric for Cross-Lingual Text
Representations Without Parallel Corpora [19.02834713111249]
Backretrieval is shown to correlate with ground truth metrics on annotated datasets.
Our experiments conclude with a case study on a recipe dataset without parallel cross-lingual data.
arXiv Detail & Related papers (2021-05-11T12:14:24Z) - CDA: a Cost Efficient Content-based Multilingual Web Document Aligner [97.98885151955467]
We introduce a Content-based Document Alignment approach to align multilingual web documents based on content.
We leverage lexical translation models to build vector representations using TF-IDF.
Experiments show that CDA is robust, cost-effective, and is significantly superior in (i) processing large and noisy web data and (ii) scaling to new and low-resourced languages.
arXiv Detail & Related papers (2021-02-20T03:37:23Z) - Context-aware Decoder for Neural Machine Translation using a Target-side
Document-Level Language Model [12.543106304662059]
We present a method to turn a sentence-level translation model into a context-aware model by incorporating a document-level language model into the decoder.
Our decoder is built upon only a sentence-level parallel corpora and monolingual corpora.
In a theoretical viewpoint, the core part of this work is the novel representation of contextual information using point-wise mutual information between context and the current sentence.
arXiv Detail & Related papers (2020-10-24T08:06:18Z) - Global Attention for Name Tagging [56.62059996864408]
We present a new framework to improve name tagging by utilizing local, document-level, and corpus-level contextual information.
We propose a model that learns to incorporate document-level and corpus-level contextual information alongside local contextual information via global attentions.
Experiments on benchmark datasets show the effectiveness of our approach.
arXiv Detail & Related papers (2020-10-19T07:27:15Z) - A High-Quality Multilingual Dataset for Structured Documentation
Translation [101.41835967142521]
This paper presents a high-quality multilingual dataset for the documentation domain.
We collect XML-structured parallel text segments from the online documentation for an enterprise software platform.
arXiv Detail & Related papers (2020-06-24T02:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.