SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological
Inflection
- URL: http://arxiv.org/abs/2006.11572v2
- Date: Tue, 14 Jul 2020 11:17:11 GMT
- Title: SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological
Inflection
- Authors: Ekaterina Vylomova, Jennifer White, Elizabeth Salesky, Sabrina J.
Mielke, Shijie Wu, Edoardo Ponti, Rowan Hall Maudslay, Ran Zmigrod, Josef
Valvoda, Svetlana Toldova, Francis Tyers, Elena Klyachko, Ilya Yegorov,
Natalia Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Andrew
Krizhanovsky, Tiago Pimentel, Lucas Torroba Hennigen, Christo Kirov, Garrett
Nicolai, Adina Williams, Antonios Anastasopoulos, Hilaria Cruz, Eleanor
Chodroff, Ryan Cotterell, Miikka Silfverberg, Mans Hulden
- Abstract summary: The SIGMORPHON 2020 task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages.
Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
- Score: 81.85463892070085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A broad goal in natural language processing (NLP) is to develop a system that
has the capacity to process any natural language. Most systems, however, are
developed using data from just one language such as English. The SIGMORPHON
2020 shared task on morphological reinflection aims to investigate systems'
ability to generalize across typologically distinct languages, many of which
are low resource. Systems were developed using data from 45 languages and just
5 language families, fine-tuned with data from an additional 45 languages and
10 language families (13 in total), and evaluated on all 90 languages. A total
of 22 systems (19 neural) from 10 teams were submitted to the task. All four
winning systems were neural (two monolingual transformers and two massively
multilingual RNN-based models with gated attention). Most teams demonstrate
utility of data hallucination and augmentation, ensembles, and multilingual
training for low-resource languages. Non-neural learners and manually designed
grammars showed competitive and even superior performance on some languages
(such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited
data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were
relatively easy for most systems and achieved over 90% mean accuracy while
others were more challenging.
Related papers
- On the Multilingual Ability of Decoder-based Pre-trained Language Models: Finding and Controlling Language-Specific Neurons [37.32174349956148]
We analyze the neuron-level internal behavior of multilingual decoder-based language models (PLMs)
We show that language-specific neurons are unique, with a slight overlap ( 5%) between languages.
We tamper with less than 1% of the total neurons in each model during inference and demonstrate that tampering with a few language-specific neurons drastically changes the probability of target language occurrence in text generation.
arXiv Detail & Related papers (2024-04-03T03:37:22Z) - When Is Multilinguality a Curse? Language Modeling for 250 High- and
Low-Resource Languages [25.52470575274251]
We pre-train over 10,000 monolingual and multilingual language models for over 250 languages.
We find that in moderation, adding multilingual data improves low-resource language modeling performance.
As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages.
arXiv Detail & Related papers (2023-11-15T18:47:42Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - GlobalBench: A Benchmark for Global Progress in Natural Language
Processing [114.24519009839142]
GlobalBench aims to track progress on all NLP datasets in all languages.
Tracks estimated per-speaker utility and equity of technology across all languages.
Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages.
arXiv Detail & Related papers (2023-05-24T04:36:32Z) - Same Neurons, Different Languages: Probing Morphosyntax in Multilingual
Pre-trained Models [84.86942006830772]
We conjecture that multilingual pre-trained models can derive language-universal abstractions about grammar.
We conduct the first large-scale empirical study over 43 languages and 14 morphosyntactic categories with a state-of-the-art neuron-level probe.
arXiv Detail & Related papers (2022-05-04T12:22:31Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - MuRIL: Multilingual Representations for Indian Languages [3.529875637780551]
India is a multilingual society with 1369 rationalized languages and dialects being spoken across the country.
Despite this, today's state-of-the-art multilingual systems perform suboptimally on Indian (IN) languages.
We propose MuRIL, a multilingual language model specifically built for IN languages.
arXiv Detail & Related papers (2021-03-19T11:06:37Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.