Benchmarking Procedural Language Understanding for Low-Resource
Languages: A Case Study on Turkish
- URL: http://arxiv.org/abs/2309.06698v2
- Date: Wed, 6 Mar 2024 20:05:37 GMT
- Title: Benchmarking Procedural Language Understanding for Low-Resource
Languages: A Case Study on Turkish
- Authors: Arda Uzunoglu and G\"ozde G\"ul \c{S}ahin
- Abstract summary: We conduct a case study on Turkish procedural texts.
We first expand the number of tutorials in Turkish wikiHow from 2,000 to 52,000 using automated translation tools.
We generate several downstream tasks on the corpus, such as linking actions, goal inference, and summarization.
- Score: 2.396465363376008
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding procedural natural language (e.g., step-by-step instructions)
is a crucial step to execution and planning. However, while there are ample
corpora and downstream tasks available in English, the field lacks such
resources for most languages. To address this gap, we conduct a case study on
Turkish procedural texts. We first expand the number of tutorials in Turkish
wikiHow from 2,000 to 52,000 using automated translation tools, where the
translation quality and loyalty to the original meaning are validated by a team
of experts on a random set. Then, we generate several downstream tasks on the
corpus, such as linking actions, goal inference, and summarization. To tackle
these tasks, we implement strong baseline models via fine-tuning large
language-specific models such as TR-BART and BERTurk, as well as multilingual
models such as mBART, mT5, and XLM. We find that language-specific models
consistently outperform their multilingual models by a significant margin
across most procedural language understanding (PLU) tasks. We release our
corpus, downstream tasks and the baseline models with https://github.com/
GGLAB-KU/turkish-plu.
Related papers
- PLUG: Leveraging Pivot Language in Cross-Lingual Instruction Tuning [46.153828074152436]
We propose a pivot language guided generation approach to enhance instruction tuning in lower-resource languages.
It trains the model to first process instructions in the pivot language, and then produce responses in the target language.
Our approach demonstrates a significant improvement in the instruction-following abilities of LLMs by 29% on average.
arXiv Detail & Related papers (2023-11-15T05:28:07Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Comparison of Pre-trained Language Models for Turkish Address Parsing [0.0]
We focus on Turkish maps data and thoroughly evaluate both multilingual and Turkish based BERT, DistilBERT, ELECTRA and RoBERTa.
We also propose a MultiLayer Perceptron (MLP) for fine-tuning BERT in addition to the standard approach of one-layer fine-tuning.
arXiv Detail & Related papers (2023-06-24T12:09:43Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - Goal-Oriented Script Construction [23.6227797113877]
We propose the Goal-Oriented Script Construction task, where a model produces a sequence of steps to accomplish a given goal.
We pilot our task on the first multilingual script learning dataset supporting 18 languages collected from wikiHow.
arXiv Detail & Related papers (2021-07-28T06:39:31Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.