Annotating and Inferring Compositional Structures in Numeral Systems Across Languages
- URL: http://arxiv.org/abs/2503.01625v2
- Date: Tue, 04 Mar 2025 15:33:16 GMT
- Title: Annotating and Inferring Compositional Structures in Numeral Systems Across Languages
- Authors: Arne Rubehn, Christoph Rzymski, Luca Ciucci, Kellen Parker van Dam, Alžběta Kučerová, Katja Bocklage, David Snee, Abishek Stephen, Johann-Mattis List,
- Abstract summary: We present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner.<n>We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure.<n>We show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.
- Score: 0.841650621412
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Numeral systems across the world's languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.
Related papers
- Why do language models perform worse for morphologically complex languages? [0.913127392774573]
We find new evidence for a performance gap between agglutinative and fusional languages.
We propose three possible causes for this performance gap: morphological alignment of tokenizers, tokenization quality, and disparities in dataset sizes and measurement.
Results suggest that no language is harder or easier for a language model to learn on the basis of its morphological typology.
arXiv Detail & Related papers (2024-11-21T15:06:51Z) - Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers.
It is common to instead use proxy tasks that are similar in only an informal sense.
We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - A Morphology-Based Investigation of Positional Encodings [46.667985003225496]
Morphology and word order are closely linked, with the latter incorporated into transformer-based models through positional encodings.
This prompts a fundamental inquiry: Is there a correlation between the morphological complexity of a language and the utilization of positional encoding in pre-trained language models?
In pursuit of an answer, we present the first study addressing this question, encompassing 22 languages and 5 downstream tasks.
arXiv Detail & Related papers (2024-04-06T07:10:47Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement [5.223020867766102]
We investigate how different tokenization schemes impact number agreement in Spanish plurals.
We find that morphologically-aligned tokenization performs similarly to other tokenization schemes.
Our results indicate that morphological tokenization is not strictly required for performance.
arXiv Detail & Related papers (2024-03-20T17:01:56Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - Minimal Supervision for Morphological Inflection [8.532288965425805]
We bootstrapping labeled data from a seed as little as em five labeled paradigms, accompanied by a large bulk of unlabeled text.
Our approach exploits different kinds of regularities in morphological systems in a two-phased setup.
We experiment with the Paradigm Cell Filling Problem over eight typologically different languages, and find that, in languages with relatively simple morphology, orthographic regularities on their own allow inflection models to achieve respectable accuracy.
arXiv Detail & Related papers (2021-04-17T11:07:36Z) - Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text.
We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z) - Probing for Multilingual Numerical Understanding in Transformer-Based
Language Models [0.0]
We propose novel probing tasks tested on DistilBERT, XLM, and BERT to investigate for evidence of compositional reasoning over numerical data in various natural language number systems.
By using both grammaticality judgment and value comparison classification tasks in English, Japanese, Danish, and French, we find evidence that the information encoded in these pretrained models' embeddings is sufficient for grammaticality judgments but generally not for value comparisons.
arXiv Detail & Related papers (2020-10-13T19:56:02Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.