Trimming Phonetic Alignments Improves the Inference of Sound
Correspondence Patterns from Multilingual Wordlists
- URL: http://arxiv.org/abs/2303.17932v1
- Date: Fri, 31 Mar 2023 09:55:48 GMT
- Title: Trimming Phonetic Alignments Improves the Inference of Sound
Correspondence Patterns from Multilingual Wordlists
- Authors: Frederic Blum and Johann-Mattis List
- Abstract summary: Methods for the automatic inference of correspondence patterns from phonetically aligned cognate sets have been proposed.
Since annotation is tedious and time consuming, it would be desirable to find ways to improve aligned cognate data automatically.
We propose a workflow that trims phonetic alignments in comparative linguistics prior to the inference of correspondence patterns.
- Score: 3.096615629099617
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sound correspondence patterns form the basis of cognate detection and
phonological reconstruction in historical language comparison. Methods for the
automatic inference of correspondence patterns from phonetically aligned
cognate sets have been proposed, but their application to multilingual
wordlists requires extremely well annotated datasets. Since annotation is
tedious and time consuming, it would be desirable to find ways to improve
aligned cognate data automatically. Taking inspiration from trimming techniques
in evolutionary biology, which improve alignments by excluding problematic
sites, we propose a workflow that trims phonetic alignments in comparative
linguistics prior to the inference of correspondence patterns. Testing these
techniques on a large standardized collection of ten datasets with expert
annotations from different language families, we find that the best trimming
technique substantially improves the overall consistency of the alignments. The
results show a clear increase in the proportion of frequent correspondence
patterns and words exhibiting regular cognate relations.
Related papers
- CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.
We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.
Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - Automated Cognate Detection as a Supervised Link Prediction Task with
Cognate Transformer [4.609569810881602]
Identification of cognates across related languages is one of the primary problems in historical linguistics.
We present a transformer-based architecture inspired by computational biology for the task of automated cognate detection.
arXiv Detail & Related papers (2024-02-05T11:47:36Z) - Learning-to-Rank Meets Language: Boosting Language-Driven Ordering
Alignment for Ordinal Classification [60.28913031192201]
We present a novel language-driven ordering alignment method for ordinal classification.
Recent developments in pre-trained vision-language models inspire us to leverage the rich ordinal priors in human language.
Experiments on three ordinal classification tasks, including facial age estimation, historical color image (HCI) classification, and aesthetic assessment demonstrate its promising performance.
arXiv Detail & Related papers (2023-06-24T04:11:31Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - CCPrefix: Counterfactual Contrastive Prefix-Tuning for Many-Class
Classification [57.62886091828512]
We propose a brand-new prefix-tuning method, Counterfactual Contrastive Prefix-tuning (CCPrefix) for many-class classification.
Basically, an instance-dependent soft prefix, derived from fact-counterfactual pairs in the label space, is leveraged to complement the language verbalizers in many-class classification.
arXiv Detail & Related papers (2022-11-11T03:45:59Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - A New Framework for Fast Automated Phonological Reconstruction Using
Trimmed Alignments and Sound Correspondence Patterns [2.6212127510234797]
We present a new framework that combines state-of-the-art techniques for automated sequence comparison with novel techniques for phonetic alignment analysis and sound correspondence pattern detection.
Our method yields promising results while at the same time being not only fast but also easy to apply and expand.
arXiv Detail & Related papers (2022-04-10T07:11:19Z) - Dynamically Refined Regularization for Improving Cross-corpora Hate
Speech Detection [30.462596705180534]
Hate speech classifiers exhibit substantial performance degradation when evaluated on datasets different from the source.
Previous work has attempted to mitigate this problem by regularizing specific terms from pre-defined static dictionaries.
We propose to automatically identify and reduce spurious correlations using attribution methods with dynamic refinement of the list of terms.
arXiv Detail & Related papers (2022-03-23T16:58:10Z) - Multilingual Alignment of Contextual Word Representations [49.42244463346612]
BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model.
We introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer.
These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
arXiv Detail & Related papers (2020-02-10T03:27:21Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.