Considerations for Multilingual Wikipedia Research
- URL: http://arxiv.org/abs/2204.02483v1
- Date: Tue, 5 Apr 2022 20:34:15 GMT
- Title: Considerations for Multilingual Wikipedia Research
- Authors: Isaac Johnson and Emily Lescak
- Abstract summary: Non-English language editions of Wikipedia have led to the inclusion of many more language editions in datasets and models.
This paper seeks to provide some background to help researchers think about what differences might arise between different language editions of Wikipedia.
- Score: 1.5736899098702972
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: English Wikipedia has long been an important data source for much research
and natural language machine learning modeling. The growth of non-English
language editions of Wikipedia, greater computational resources, and calls for
equity in the performance of language and multimodal models have led to the
inclusion of many more language editions of Wikipedia in datasets and models.
Building better multilingual and multimodal models requires more than just
access to expanded datasets; it also requires a better understanding of what is
in the data and how this content was generated. This paper seeks to provide
some background to help researchers think about what differences might arise
between different language editions of Wikipedia and how that might affect
their models. It details three major ways in which content differences between
language editions arise (local context, community and governance, and
technology) and recommendations for good practices when using multilingual and
multimodal data for research and modeling.
Related papers
- Towards Better Monolingual Japanese Retrievers with Multi-Vector Models [0.0]
In Japanese, the best performing deep-learning based retrieval approaches rely on multilingual dense embedders.
We introduce JaColBERT, a family of multi-vector retrievers trained on two magnitudes fewer data than their multilingual counterparts.
arXiv Detail & Related papers (2023-12-26T18:07:05Z) - The Less the Merrier? Investigating Language Representation in
Multilingual Models [8.632506864465501]
We investigate the linguistic representation of different languages in multilingual models.
We observe from our experiments that community-centered models perform better at distinguishing between languages in the same family for low-resource languages.
arXiv Detail & Related papers (2023-10-20T02:26:34Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages.
Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z) - Lost in Translation: Large Language Models in Non-English Content
Analysis [0.0]
Large language models have become the dominant approach for building AI systems to analyze and generate language online.
Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English.
arXiv Detail & Related papers (2023-06-12T19:10:47Z) - Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design.
Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z) - Are Multilingual Models the Best Choice for Moderately Under-resourced
Languages? A Comprehensive Assessment for Catalan [0.05277024349608833]
This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models.
We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings.
The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
arXiv Detail & Related papers (2021-07-16T13:52:01Z) - Are pre-trained text representations useful for multilingual and
multi-dimensional language proficiency modeling? [6.294759639481189]
This paper describes our experiments and observations about the role of pre-trained and fine-tuned multilingual embeddings in performing multi-dimensional, multilingual language proficiency classification.
Our results indicate that while fine-tuned embeddings are useful for multilingual proficiency modeling, none of the features achieve consistently best performance for all dimensions of language proficiency.
arXiv Detail & Related papers (2021-02-25T16:23:52Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Multiple Texts as a Limiting Factor in Online Learning: Quantifying
(Dis-)similarities of Knowledge Networks across Languages [60.00219873112454]
We investigate the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted.
Since Wikipedia is a central part of the web-based information landscape, this indicates a language-related, linguistic bias.
The article builds a bridge between reading research, educational science, Wikipedia research and computational linguistics.
arXiv Detail & Related papers (2020-08-05T11:11:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.