Multilingual Stance Detection: The Catalonia Independence Corpus
- URL: http://arxiv.org/abs/2004.00050v1
- Date: Tue, 31 Mar 2020 18:28:36 GMT
- Title: Multilingual Stance Detection: The Catalonia Independence Corpus
- Authors: Elena Zotova, Rodrigo Agerri, Manuel Nu\~nez, German Rigau
- Abstract summary: Stance detection aims to determine the attitude of a text with respect to a specific topic or claim.
TW-10 Referendum dataset released at IberEval 2018 is a previous effort to provide multilingual stance-annotated data in Catalan and Spanish.
This paper presents a new multilingual dataset for stance detection in Twitter for the Catalan and Spanish languages.
- Score: 11.393603788068777
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stance detection aims to determine the attitude of a given text with respect
to a specific topic or claim. While stance detection has been fairly well
researched in the last years, most the work has been focused on English. This
is mainly due to the relative lack of annotated data in other languages. The
TW-10 Referendum Dataset released at IberEval 2018 is a previous effort to
provide multilingual stance-annotated data in Catalan and Spanish.
Unfortunately, the TW-10 Catalan subset is extremely imbalanced. This paper
addresses these issues by presenting a new multilingual dataset for stance
detection in Twitter for the Catalan and Spanish languages, with the aim of
facilitating research on stance detection in multilingual and cross-lingual
settings. The dataset is annotated with stance towards one topic, namely, the
independence of Catalonia. We also provide a semi-automatic method to annotate
the dataset based on a categorization of Twitter users. We experiment on the
new corpus with a number of supervised approaches, including linear classifiers
and deep learning methods. Comparison of our new corpus with the with the TW-1O
dataset shows both the benefits and potential of a well balanced corpus for
multilingual and cross-lingual research on stance detection. Finally, we
establish new state-of-the-art results on the TW-10 dataset, both for Catalan
and Spanish.
Related papers
- Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Czech Dataset for Cross-lingual Subjectivity Classification [13.70633147306388]
We introduce a new Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions.
Two annotators annotated the dataset reaching 0.83 of the Cohen's kappa inter-annotator agreement.
We fine-tune five pre-trained BERT-like models to set a monolingual baseline for the new dataset and we achieve 93.56% of accuracy.
arXiv Detail & Related papers (2022-04-29T07:31:46Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task.
We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English.
We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - MLSUM: The Multilingual Summarization Corpus [29.943949944682196]
MLSUM is the first large-scale MultiLingual SUMmarization dataset.
It contains 1.5M+ article/summary pairs in five different languages.
arXiv Detail & Related papers (2020-04-30T15:58:34Z) - Cross-lingual Emotion Intensity Prediction [13.305282275999778]
Cross-lingual transfer approaches for fine-grained emotion detection in Spanish and Catalan tweets.
We compare six cross-lingual approaches, e.g., machine translation and cross-lingual embeddings, which have varying requirements for parallel data.
The results show that methods with low parallel-data requirements perform surprisingly better than methods that use more parallel data.
arXiv Detail & Related papers (2020-04-08T16:28:16Z) - X-Stance: A Multilingual Multi-Target Dataset for Stance Detection [42.46681912294797]
We extract a large-scale stance detection dataset from comments written by candidates of elections in Switzerland.
The dataset consists of German, French and Italian text, allowing for a cross-lingual evaluation of stance detection.
arXiv Detail & Related papers (2020-03-18T17:58:10Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.