Gamified Crowdsourcing for Idiom Corpora Construction
- URL: http://arxiv.org/abs/2102.00881v1
- Date: Mon, 1 Feb 2021 14:44:43 GMT
- Title: Gamified Crowdsourcing for Idiom Corpora Construction
- Authors: G\"ul\c{s}en Eryi\u{g}it, Ali \c{S}enta\c{s}, Johanna Monti
- Abstract summary: This article introduces a gamified crowdsourcing approach for collecting language learning materials for idiomatic expressions.
A messaging bot is designed as an asynchronous multiplayer game for native speakers who compete with each other.
The approach has been shown to have the potential to speed up the construction of idiom corpora for different natural languages.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning idiomatic expressions is seen as one of the most challenging stages
in second language learning because of their unpredictable meaning. A similar
situation holds for their identification within natural language processing
applications such as machine translation and parsing. The lack of high-quality
usage samples exacerbates this challenge not only for humans but also for
artificial intelligence systems. This article introduces a gamified
crowdsourcing approach for collecting language learning materials for idiomatic
expressions; a messaging bot is designed as an asynchronous multiplayer game
for native speakers who compete with each other while providing idiomatic and
nonidiomatic usage examples and rating other players' entries. As opposed to
classical crowdprocessing annotation efforts in the field, for the first time
in the literature, a crowdcreating & crowdrating approach is implemented and
tested for idiom corpora construction. The approach is language independent and
evaluated on two languages in comparison to traditional data preparation
techniques in the field. The reaction of the crowd is monitored under different
motivational means (namely, gamification affordances and monetary rewards). The
results reveal that the proposed approach is powerful in collecting the
targeted materials, and although being an explicit crowdsourcing approach, it
is found entertaining and useful by the crowd. The approach has been shown to
have the potential to speed up the construction of idiom corpora for different
natural languages to be used as second language learning material, training
data for supervised idiom identification systems, or samples for lexicographic
studies.
Related papers
- Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Learning an Artificial Language for Knowledge-Sharing in Multilingual
Translation [15.32063273544696]
We discretize the latent space of multilingual models by assigning encoder states to entries in a codebook.
We validate our approach on large-scale experiments with realistic data volumes and domains.
We also use the learned artificial language to analyze model behavior, and discover that using a similar bridge language increases knowledge-sharing among the remaining languages.
arXiv Detail & Related papers (2022-11-02T17:14:42Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - A simple language-agnostic yet very strong baseline system for hate
speech and offensive content identification [0.0]
A system based on a classical supervised algorithm only fed with character n-grams, and thus completely language-agnostic, is proposed.
It reached a medium performance level in English, the language for which it is easy to develop deep learning approaches.
It ends even first when performances are averaged over the three tasks in these languages, outperforming many deep learning approaches.
arXiv Detail & Related papers (2022-02-05T08:09:09Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - It's All in the Heads: Using Attention Heads as a Baseline for
Cross-Lingual Transfer in Commonsense Reasoning [4.200736775540874]
We design a simple approach to commonsense reasoning which trains a linear classifier with weights of multi-head attention as features.
The method performs competitively with recent supervised and unsupervised approaches for commonsense reasoning.
Most of the performance is given by the same small subset of attention heads for all studied languages.
arXiv Detail & Related papers (2021-06-22T21:25:43Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.