Rethinking Annotation: Can Language Learners Contribute?
- URL: http://arxiv.org/abs/2210.06828v2
- Date: Mon, 29 May 2023 11:39:17 GMT
- Title: Rethinking Annotation: Can Language Learners Contribute?
- Authors: Haneul Yoo, Rifki Afina Putri, Changyoon Lee, Youngin Lee, So-Yeon
Ahn, Dongyeop Kang, Alice Oh
- Abstract summary: In this paper, we investigate whether language learners can contribute annotations to benchmark datasets.
We target three languages, English, Korean, and Indonesian, and the four NLP tasks of sentiment analysis, natural language inference, named entity recognition, and machine reading comprehension.
We find that language learners, especially those with intermediate or advanced levels of language proficiency, are able to provide fairly accurate labels with the help of additional resources.
- Score: 13.882919101548811
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Researchers have traditionally recruited native speakers to provide
annotations for widely used benchmark datasets. However, there are languages
for which recruiting native speakers can be difficult, and it would help to
find learners of those languages to annotate the data. In this paper, we
investigate whether language learners can contribute annotations to benchmark
datasets. In a carefully controlled annotation experiment, we recruit 36
language learners, provide two types of additional resources (dictionaries and
machine-translated sentences), and perform mini-tests to measure their language
proficiency. We target three languages, English, Korean, and Indonesian, and
the four NLP tasks of sentiment analysis, natural language inference, named
entity recognition, and machine reading comprehension. We find that language
learners, especially those with intermediate or advanced levels of language
proficiency, are able to provide fairly accurate labels with the help of
additional resources. Moreover, we show that data annotation improves learners'
language proficiency in terms of vocabulary and grammar. One implication of our
findings is that broadening the annotation task to include language learners
can open up the opportunity to build benchmark datasets for languages for which
it is difficult to recruit native speakers.
Related papers
- Large Language Model Augmented Exercise Retrieval for Personalized
Language Learning [2.946562343070891]
We find that vector similarity approaches poorly capture the relationship between exercise content and the language that learners use to express what they want to learn.
We leverage the generative capabilities of large language models to bridge the gap by synthesizing hypothetical exercises based on the learner's input.
Our approach, which we call mHyER, overcomes three challenges: (1) lack of relevance labels for training, (2) unrestricted learner input content, and (3) low semantic similarity between input and retrieval candidates.
arXiv Detail & Related papers (2024-02-08T20:35:31Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Weakly-supervised Deep Cognate Detection Framework for Low-Resourced
Languages Using Morphological Knowledge of Closely-Related Languages [1.7622337807395716]
Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks.
Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models.
This paper proposes a novel language-agnostic weakly-supervised deep cognate detection framework for under-resourced languages.
arXiv Detail & Related papers (2023-11-09T05:46:41Z) - Teacher Perception of Automatically Extracted Grammar Concepts for L2
Language Learning [66.79173000135717]
We apply this work to teaching two Indian languages, Kannada and Marathi, which do not have well-developed resources for second language learning.
We extract descriptions from a natural text corpus that answer questions about morphosyntax (learning of word order, agreement, case marking, or word formation) and semantics (learning of vocabulary).
We enlist the help of language educators from schools in North America to perform a manual evaluation, who find the materials have potential to be used for their lesson preparation and learner evaluation.
arXiv Detail & Related papers (2023-10-27T18:17:29Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages.
Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z) - Toward More Meaningful Resources for Lower-resourced Languages [2.3513645401551333]
We examine the contents of the names stored in Wikidata for a few lower-resourced languages.
We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data.
We conclude with recommended guidelines for resource development.
arXiv Detail & Related papers (2022-02-24T18:39:57Z) - Looking for Clues of Language in Multilingual BERT to Improve
Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information.
We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.