Contextualising Levels of Language Resourcedness affecting Digital
Processing of Text
- URL: http://arxiv.org/abs/2309.17035v1
- Date: Fri, 29 Sep 2023 07:48:24 GMT
- Title: Contextualising Levels of Language Resourcedness affecting Digital
Processing of Text
- Authors: C. Maria Keet and Langa Khumalo
- Abstract summary: We argue that the dichotomous typology LRL and HRL for all languages is problematic.
The characterization is based on the typology of contextual features for each category, rather than counting tools.
- Score: 0.5620321106679633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Application domains such as digital humanities and tool like chatbots involve
some form of processing natural language, from digitising hardcopies to speech
generation. The language of the content is typically characterised as either a
low resource language (LRL) or high resource language (HRL), also known as
resource-scarce and well-resourced languages, respectively. African languages
have been characterized as resource-scarce languages (Bosch et al. 2007;
Pretorius & Bosch 2003; Keet & Khumalo 2014) and English is by far the most
well-resourced language. Varied language resources are used to develop software
systems for these languages to accomplish a wide range of tasks. In this paper
we argue that the dichotomous typology LRL and HRL for all languages is
problematic. Through a clear understanding of language resources situated in a
society, a matrix is developed that characterizes languages as Very LRL, LRL,
RL, HRL and Very HRL. The characterization is based on the typology of
contextual features for each category, rather than counting tools, and
motivation is provided for each feature and each characterization. The
contextualisation of resourcedness, with a focus on African languages in this
paper, and an increased understanding of where on the scale the language used
in a project is, may assist in, among others, better planning of research and
implementation projects. We thus argue in this paper that the characterization
of language resources within a given scale in a project is an indispensable
component particularly in the context of low-resourced languages.
Related papers
- LLMs Are Few-Shot In-Context Low-Resource Language Learners [59.74451570590808]
In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages.
We extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages.
Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs.
arXiv Detail & Related papers (2024-03-25T07:55:29Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Learning Transfers over Several Programming Languages [5.350495525141013]
Cross-lingual transfer uses data from a source language to improve model performance on a target language.
This paper reports extensive experiments on four tasks using a transformer-based large language model and 11 to 41 programming languages.
We find that learning transfers well across several programming languages.
arXiv Detail & Related papers (2023-10-25T19:04:33Z) - Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages.
Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Progressive Sentiment Analysis for Code-Switched Text Data [26.71396390928905]
We focus on code-switched sentiment analysis where we have a labelled resource-rich language dataset and unlabelled code-switched data.
We propose a framework that takes the distinction between resource-rich and low-resource language into account.
arXiv Detail & Related papers (2022-10-25T23:13:53Z) - Overcoming Language Disparity in Online Content Classification with
Multimodal Learning [22.73281502531998]
Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks.
The development of advanced computational techniques and resources is disproportionately focused on the English language.
We explore the promise of incorporating the information contained in images via multimodal machine learning.
arXiv Detail & Related papers (2022-05-19T17:56:02Z) - Toward More Meaningful Resources for Lower-resourced Languages [2.3513645401551333]
We examine the contents of the names stored in Wikidata for a few lower-resourced languages.
We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data.
We conclude with recommended guidelines for resource development.
arXiv Detail & Related papers (2022-02-24T18:39:57Z) - When Word Embeddings Become Endangered [0.685316573653194]
We present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and translation dictionaries of resource-poor languages.
All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.
arXiv Detail & Related papers (2021-03-24T15:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.