CALCS 2021 Shared Task: Machine Translation for Code-Switched Data
- URL: http://arxiv.org/abs/2202.09625v1
- Date: Sat, 19 Feb 2022 15:39:34 GMT
- Title: CALCS 2021 Shared Task: Machine Translation for Code-Switched Data
- Authors: Shuguang Chen, Gustavo Aguilar, Anirudh Srinivasan, Mona Diab and
Thamar Solorio
- Abstract summary: We address machine translation for code-switched social media data.
We create a community shared task.
For the supervised setting, participants are challenged to translate English into Hindi-English (Eng-Hinglish) in a single direction.
For the unsupervised setting, we provide the following language pairs: English and Spanish-English (Eng-Spanglish), and English and Modern Standard Arabic-Egyptian Arabic (Eng-MSAEA) in both directions.
- Score: 27.28423961505655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To date, efforts in the code-switching literature have focused for the most
part on language identification, POS, NER, and syntactic parsing. In this
paper, we address machine translation for code-switched social media data. We
create a community shared task. We provide two modalities for participation:
supervised and unsupervised. For the supervised setting, participants are
challenged to translate English into Hindi-English (Eng-Hinglish) in a single
direction. For the unsupervised setting, we provide the following language
pairs: English and Spanish-English (Eng-Spanglish), and English and Modern
Standard Arabic-Egyptian Arabic (Eng-MSAEA) in both directions. We share
insights and challenges in curating the "into" code-switching language
evaluation data. Further, we provide baselines for all language pairs in the
shared task. The leaderboard for the shared task comprises 12 individual system
submissions corresponding to 5 different teams. The best performance achieved
is 12.67% BLEU score for English to Hinglish and 25.72% BLEU score for MSAEA to
English.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - Gui at MixMT 2022 : English-Hinglish: An MT approach for translation of
code mixed data [13.187116325089951]
We try to tackle the same for both English + Hindi to Hinglish and Hinglish to English.
To our knowledge, we achieved one of the top ROUGE-L and WER scores for the first task of Monolingual to Code-Mixed machine translation.
arXiv Detail & Related papers (2022-10-21T19:48:18Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - Handshakes AI Research at CASE 2021 Task 1: Exploring different
approaches for multilingual tasks [0.22940141855172036]
The aim of the CASE 2021 Shared Task 1 was to detect and classify socio-political and crisis event information in a multilingual setting.
Our submission contained entries in all of the subtasks, and the scores obtained validated our research finding.
arXiv Detail & Related papers (2021-10-29T07:58:49Z) - WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language
Identification in Code-switched YouTube Comments [16.938836887702923]
This paper describes the WLV-RIT entry to the Hate Speech and Offensive Content Identification in Indo-European languages task 2020.
The HASOC 2020 organizers provided participants with datasets containing social media posts of code-mixed in Dravidian languages (Malayalam-English and Tamil-English)
Our system achieved 0.89 weighted average F1 score for the test set and it ranked 5th place out of 12 participants.
arXiv Detail & Related papers (2020-11-01T16:52:08Z) - BRUMS at SemEval-2020 Task 12 : Transformer based Multilingual Offensive
Language Identification in Social Media [9.710464466895521]
We present a multilingual deep learning model to identify offensive language in social media.
The approach achieves acceptable evaluation scores, while maintaining flexibility between languages.
arXiv Detail & Related papers (2020-10-13T10:39:14Z) - SJTU-NICT's Supervised and Unsupervised Neural Machine Translation
Systems for the WMT20 News Translation Task [111.91077204077817]
We participated in four translation directions of three language pairs: English-Chinese, English-Polish, and German-Upper Sorbian.
Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques.
In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.
arXiv Detail & Related papers (2020-10-11T00:40:05Z) - ANDES at SemEval-2020 Task 12: A jointly-trained BERT multilingual model
for offensive language detection [0.6445605125467572]
We jointly-trained a single model by fine-tuning Multilingual BERT to tackle the task across all the proposed languages.
Our single model had competitive results, with a performance close to top-performing systems.
arXiv Detail & Related papers (2020-08-13T16:07:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.