The SIGMORPHON 2022 Shared Task on Morpheme Segmentation
- URL: http://arxiv.org/abs/2206.07615v1
- Date: Wed, 15 Jun 2022 15:57:22 GMT
- Title: The SIGMORPHON 2022 Shared Task on Morpheme Segmentation
- Authors: Khuyagbaatar Batsuren, G\'abor Bella, Aryaman Arora, Viktor
Martinovi\'c, Kyle Gorman, Zden\v{e}k \v{Z}abokrtsk\'y, Amarsanaa Ganbold,
\v{S}\'arka Dohnalov\'a, Magda \v{S}ev\v{c}\'ikov\'a, Kate\v{r}ina
Pelegrinov\'a, Fausto Giunchiglia, Ryan Cotterell, Ekaterina Vylomova
- Abstract summary: The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes.
The best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute.
To facilitate error analysis and support any type of future studies, we released all system predictions, the evaluation script, and all gold standard datasets.
- Score: 39.44280269663147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems
to decompose a word into a sequence of morphemes and covered most types of
morphology: compounds, derivations, and inflections. Subtask 1, word-level
morpheme segmentation, covered 5 million words in 9 languages (Czech, English,
Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13
system submissions from 7 teams and the best system averaged 97.29% F1 score
across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2,
sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages
(Czech, English, Mongolian), received 10 system submissions from 3 teams, and
the best systems outperformed all three state-of-the-art subword tokenization
methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis
and support any type of future studies, we released all system predictions, the
evaluation script, and all gold standard datasets.
Related papers
- SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection [76.18321723846616]
Task covers more than 30 languages from seven distinct language families.
Data instances are multi-labeled with six emotional classes, with additional datasets in 11 languages annotated for emotion intensity.
Participants were asked to predict labels in three tracks: (a) multilabel emotion detection, (b) emotion intensity score detection, and (c) cross-lingual emotion detection.
arXiv Detail & Related papers (2025-03-10T12:49:31Z) - SmurfCat at PAN 2024 TextDetox: Alignment of Multilingual Transformers for Text Detoxification [41.94295877935867]
This paper presents a solution for the Multilingual Text Detoxification task in the PAN-2024 competition of the SmurfCat team.
Using data augmentation through machine translation and a special filtering procedure, we collected an additional multilingual parallel dataset for text detoxification.
We fine-tuned several multilingual sequence-to-sequence models, such as mT0 and Aya, on a text detoxification task.
arXiv Detail & Related papers (2024-07-07T17:19:34Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - MIA 2022 Shared Task: Evaluating Cross-lingual Open-Retrieval Question
Answering for 16 Diverse Languages [54.002969723086075]
We evaluate cross-lingual open-retrieval question answering systems in 16 typologically diverse languages.
The best system leveraging iteratively mined diverse negative examples achieves 32.2 F1, outperforming our baseline by 4.5 points.
The second best system uses entity-aware contextualized representations for document retrieval, and achieves significant improvements in Tamil (20.8 F1), whereas most of the other systems yield nearly zero scores.
arXiv Detail & Related papers (2022-07-02T06:54:10Z) - 1Cademy at Semeval-2022 Task 1: Investigating the Effectiveness of
Multilingual, Multitask, and Language-Agnostic Tricks for the Reverse
Dictionary Task [13.480318097164389]
We focus on the Reverse Dictionary Track of the SemEval2022 task of matching dictionary glosses to word embeddings.
Models convert the input of sentences to three types of embeddings: SGNS, Char, and Electra.
Our proposed Elmobased monolingual model achieves the highest outcome.
arXiv Detail & Related papers (2022-06-08T06:39:04Z) - UniMorph 4.0: Universal Morphology [104.69846084893298]
This paper presents the expansions and improvements made on several fronts over the last couple of years.
Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages.
In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages.
arXiv Detail & Related papers (2022-05-07T09:19:02Z) - SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological
Inflection [81.85463892070085]
The SIGMORPHON 2020 task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages.
Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
arXiv Detail & Related papers (2020-06-20T13:24:14Z) - The SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm
Completion [28.728844366333185]
In this paper, we describe the findings of the SIGMORPHON 2020 shared task on unsupervised morphological paradigm completion.
Participants were asked to submit systems which take raw text and a list of lemmas as input, and output all inflected forms.
We present an analysis here, so that this shared task will ground further research on the topic.
arXiv Detail & Related papers (2020-05-28T03:09:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.