Related papers: Data Processing Matters: SRPH-Konvergen AI's Machine Translation System for WMT'21

Data Processing Matters: SRPH-Konvergen AI's Machine Translation System for WMT'21

URL: http://arxiv.org/abs/2111.10513v1
Date: Sat, 20 Nov 2021 03:56:37 GMT
Title: Data Processing Matters: SRPH-Konvergen AI's Machine Translation System for WMT'21
Authors: Lintang Sutawika and Jan Christian Blaise Cruz
Abstract summary: We describe the submission of the joint Samsung Research Philippines-Konvergen AI team for the WMT'21 Large Scale Multilingual Translation Task - Small Track 2. We submit a standard Seq2Seq Transformer model to the shared task, relying mainly on the strength of our data preprocessing techniques to boost performance. Our model ranked first in Indonesian to Javanese, showing that data preprocessing matters equally, if not more, than cutting edge model architectures and training techniques.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In this paper, we describe the submission of the joint Samsung Research Philippines-Konvergen AI team for the WMT'21 Large Scale Multilingual Translation Task - Small Track 2. We submit a standard Seq2Seq Transformer model to the shared task without any training or architecture tricks, relying mainly on the strength of our data preprocessing techniques to boost performance. Our final submission model scored 22.92 average BLEU on the FLORES-101 devtest set, and scored 22.97 average BLEU on the contest's hidden test set, ranking us sixth overall. Despite using only a standard Transformer, our model ranked first in Indonesian to Javanese, showing that data preprocessing matters equally, if not more, than cutting edge model architectures and training techniques.

Related papers

SINAI at eRisk@CLEF 2023: Approaching Early Detection of Gambling with Natural Language Processing [3.987649624343527]
This paper describes the participation of the SINAI team in the eRisk@CLEF lab.<n>One of the proposed tasks has been addressed: Task 2 on the early detection of signs of pathological gambling.<n>The approach presented in Task 2 is based on pre-trained models from Transformers architecture with comprehensive preprocessing data and data balancing techniques.
arXiv Detail & Related papers (2025-09-18T09:50:14Z)
Mavericks at ArAIEval Shared Task: Towards a Safer Digital Space -- Transformer Ensemble Models Tackling Deception and Persuasion [0.0]
We present our approaches for task 1-A and task 2-A of the shared task which focus on persuasion technique detection and disinformation detection respectively. The tasks use multigenre snippets of tweets and news articles for the given binary classification problem. We achieved a micro F1-score of 0.742 on task 1-A (8th rank on the leaderboard) and 0.901 on task 2-A (7th rank on the leaderboard) respectively.
arXiv Detail & Related papers (2023-11-30T17:26:57Z)
BJTU-WeChat's Systems for the WMT22 Chat Translation Task [66.81525961469494]
This paper introduces the joint submission of the Beijing Jiaotong University and WeChat AI to the WMT'22 chat translation task for English-German. Based on the Transformer, we apply several effective variants. Our systems achieve 0.810 and 0.946 COMET scores.
arXiv Detail & Related papers (2022-11-28T02:35:04Z)
TSMind: Alibaba and Soochow University's Submission to the WMT22 Translation Suggestion Task [16.986003476984965]
This paper describes the joint submission of Alibaba and Soochow University, TSMind, to the WMT 2022 Shared Task on Translation Suggestion. Basically, we utilize the model paradigm fine-tuning on the downstream tasks based on large-scale pre-trained models. Considering the task's condition of limited use of training data, we follow the data augmentation strategies proposed by WeTS to boost our TS model performance.
arXiv Detail & Related papers (2022-11-16T15:43:31Z)
Alibaba-Translate China's Submission for WMT 2022 Quality Estimation Shared Task [80.22825549235556]
We present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE. Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model. Results show that our models reach 1st overall ranking in the Multilingual and English-Russian settings, and 2nd overall ranking in English-German and Chinese-English settings.
arXiv Detail & Related papers (2022-10-18T08:55:27Z)
Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation System for the WMT22 Translation Task [49.916963624249355]
This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task. We participate in the general translation task on English$Leftrightarrow$Livonian. Our system is based on M2M100 with novel techniques that adapt it to the target language pair.
arXiv Detail & Related papers (2022-10-17T04:34:09Z)
The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task [92.5087402621697]
This paper describes the submission of our end-to-end YiTrans speech translation system for the IWSLT 2022 offline task. The YiTrans system is built on large-scale pre-trained encoder-decoder models. Our final submissions rank first on English-German and English-Chinese end-to-end systems in terms of the automatic evaluation metric.
arXiv Detail & Related papers (2022-06-12T16:13:01Z)
RoBLEURT Submission for the WMT2021 Metrics Task [72.26898579202076]
We present our submission to the Shared Metrics Task: RoBLEURT. Our model reaches state-of-the-art correlations with the WMT 2020 human annotations upon 8 out of 10 to-English language pairs.
arXiv Detail & Related papers (2022-04-28T08:49:40Z)
The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task. We trained our models with the officially provided ASR and MT datasets. To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.