Abstract: Popular social media networks provide the perfect environment to study the
opinions and attitudes expressed by users. While interactions in social media
such as Twitter occur in many natural languages, research on stance detection
(the position or attitude expressed with respect to a specific topic) within
the Natural Language Processing field has largely been done for English.
Although some efforts have recently been made to develop annotated data in
other languages, there is a telling lack of resources to facilitate
multilingual and crosslingual research on stance detection. This is partially
due to the fact that manually annotating a corpus of social media texts is a
difficult, slow and costly process. Furthermore, as stance is a highly domain-
and topic-specific phenomenon, the need for annotated data is specially
demanding. As a result, most of the manually labeled resources are hindered by
their relatively small size and skewed class distribution. This paper presents
a method to obtain multilingual datasets for stance detection in Twitter.
Instead of manually annotating on a per tweet basis, we leverage user-based
information to semi-automatically label large amounts of tweets. Empirical
monolingual and cross-lingual experimentation and qualitative analysis show
that our method helps to overcome the aforementioned difficulties to build
large, balanced and multilingual labeled corpora. We believe that our method
can be easily adapted to easily generate labeled social media data for other
Natural Language Processing tasks and domains.