Related papers: Rethinking Annotation: Can Language Learners Contribute?

Rethinking Annotation: Can Language Learners Contribute?

URL: http://arxiv.org/abs/2210.06828v2
Date: Mon, 29 May 2023 11:39:17 GMT
Title: Rethinking Annotation: Can Language Learners Contribute?
Authors: Haneul Yoo, Rifki Afina Putri, Changyoon Lee, Youngin Lee, So-Yeon Ahn, Dongyeop Kang, Alice Oh
Abstract summary: In this paper, we investigate whether language learners can contribute annotations to benchmark datasets. We target three languages, English, Korean, and Indonesian, and the four NLP tasks of sentiment analysis, natural language inference, named entity recognition, and machine reading comprehension. We find that language learners, especially those with intermediate or advanced levels of language proficiency, are able to provide fairly accurate labels with the help of additional resources.
Score: 13.882919101548811
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Researchers have traditionally recruited native speakers to provide annotations for widely used benchmark datasets. However, there are languages for which recruiting native speakers can be difficult, and it would help to find learners of those languages to annotate the data. In this paper, we investigate whether language learners can contribute annotations to benchmark datasets. In a carefully controlled annotation experiment, we recruit 36 language learners, provide two types of additional resources (dictionaries and machine-translated sentences), and perform mini-tests to measure their language proficiency. We target three languages, English, Korean, and Indonesian, and the four NLP tasks of sentiment analysis, natural language inference, named entity recognition, and machine reading comprehension. We find that language learners, especially those with intermediate or advanced levels of language proficiency, are able to provide fairly accurate labels with the help of additional resources. Moreover, we show that data annotation improves learners' language proficiency in terms of vocabulary and grammar. One implication of our findings is that broadening the annotation task to include language learners can open up the opportunity to build benchmark datasets for languages for which it is difficult to recruit native speakers.

Related papers

Natural language processing for African languages [7.884789325654572]
dissertation focuses on languages spoken in Sub-Saharan Africa where all the indigenous languages can be regarded as low-resourced.<n>We show that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data.<n>We develop large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks.
arXiv Detail & Related papers (2025-06-30T22:26:36Z)
Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights [27.369278566345074]
This paper presents an intrinsic evaluation of tokenization strategies across 17 Indian languages.<n>We quantify the trade-offs between bottom-up and top-down tokenizer algorithms.<n>We show that extremely low-resource languages can benefit from tokenizers trained on related high-resource languages.
arXiv Detail & Related papers (2025-06-21T18:47:33Z)
Large Language Model Augmented Exercise Retrieval for Personalized Language Learning [2.946562343070891]
We find that vector similarity approaches poorly capture the relationship between exercise content and the language that learners use to express what they want to learn. We leverage the generative capabilities of large language models to bridge the gap by synthesizing hypothetical exercises based on the learner's input. Our approach, which we call mHyER, overcomes three challenges: (1) lack of relevance labels for training, (2) unrestricted learner input content, and (3) low semantic similarity between input and retrieval candidates.
arXiv Detail & Related papers (2024-02-08T20:35:31Z)
Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets. We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z)
Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages [1.7622337807395716]
Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks. Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models. This paper proposes a novel language-agnostic weakly-supervised deep cognate detection framework for under-resourced languages.
arXiv Detail & Related papers (2023-11-09T05:46:41Z)
Teacher Perception of Automatically Extracted Grammar Concepts for L2 Language Learning [66.79173000135717]
We apply this work to teaching two Indian languages, Kannada and Marathi, which do not have well-developed resources for second language learning. We extract descriptions from a natural text corpus that answer questions about morphosyntax (learning of word order, agreement, case marking, or word formation) and semantics (learning of vocabulary). We enlist the help of language educators from schools in North America to perform a manual evaluation, who find the materials have potential to be used for their lesson preparation and learner evaluation.
arXiv Detail & Related papers (2023-10-27T18:17:29Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages. Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z)
Toward More Meaningful Resources for Lower-resourced Languages [2.3513645401551333]
We examine the contents of the names stored in Wikidata for a few lower-resourced languages. We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data. We conclude with recommended guidelines for resource development.
arXiv Detail & Related papers (2022-02-24T18:39:57Z)
Looking for Clues of Language in Multilingual BERT to Improve Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information. We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z)
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source. We observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.