Multilingual Argument Mining: Datasets and Analysis
- URL: http://arxiv.org/abs/2010.06432v1
- Date: Tue, 13 Oct 2020 14:49:10 GMT
- Title: Multilingual Argument Mining: Datasets and Analysis
- Authors: Orith Toledo-Ronen, Matan Orbach, Yonatan Bilu, Artem Spector, Noam
Slonim
- Abstract summary: We explore the potential of transfer learning using the multilingual BERT model to address argument mining tasks in non-English languages.
We show that such methods are well suited for classifying the stance of arguments and detecting evidence, but less so for assessing the quality of arguments.
We provide a human-generated dataset with more than 10k arguments in multiple languages, as well as machine translation of the English datasets.
- Score: 9.117984896907782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growing interest in argument mining and computational argumentation
brings with it a plethora of Natural Language Understanding (NLU) tasks and
corresponding datasets. However, as with many other NLU tasks, the dominant
language is English, with resources in other languages being few and far
between. In this work, we explore the potential of transfer learning using the
multilingual BERT model to address argument mining tasks in non-English
languages, based on English datasets and the use of machine translation. We
show that such methods are well suited for classifying the stance of arguments
and detecting evidence, but less so for assessing the quality of arguments,
presumably because quality is harder to preserve under translation. In
addition, focusing on the translate-train approach, we show how the choice of
languages for translation, and the relations among them, affect the accuracy of
the resultant model. Finally, to facilitate evaluation of transfer learning on
argument mining tasks, we provide a human-generated dataset with more than 10k
arguments in multiple languages, as well as machine translation of the English
datasets.
Related papers
- GradSim: Gradient-Based Language Grouping for Effective Multilingual
Training [13.730907708289331]
We propose GradSim, a language grouping method based on gradient similarity.
Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains.
Besides linguistic features, the topics of the datasets play an important role for language grouping.
arXiv Detail & Related papers (2023-10-23T18:13:37Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Is Translation Helpful? An Empirical Analysis of Cross-Lingual Transfer
in Low-Resource Dialog Generation [21.973937517854935]
Cross-lingual transfer is important for developing high-quality chatbots in multiple languages.
In this work, we investigate whether it is helpful to utilize machine translation (MT) at all in this task.
Experiments show that leveraging English dialog corpora can indeed improve the naturalness, relevance and cross-domain transferability in Chinese.
arXiv Detail & Related papers (2023-05-21T15:07:04Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language
Generation [4.874780144224057]
Cross-lingual transfer for natural language generation is relatively understudied.
We consider four NLG tasks (text summarization, question generation, news headline generation, and distractor generation) and three syntactically diverse languages.
We propose an unsupervised cross-lingual language generation framework (called ZmBART) that does not use any parallel or pseudo-parallel/back-translated data.
arXiv Detail & Related papers (2021-06-03T05:08:01Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z) - Transfer learning and subword sampling for asymmetric-resource
one-to-many neural translation [14.116412358534442]
Methods for improving neural machine translation for low-resource languages are reviewed.
Tests are carried out on three artificially restricted translation tasks and one real-world task.
Experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.
arXiv Detail & Related papers (2020-04-08T14:19:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.