Related papers: An efficient automated data analytics approach to large scale computational comparative linguistics

An efficient automated data analytics approach to large scale computational comparative linguistics

URL: http://arxiv.org/abs/2001.11899v1
Date: Fri, 31 Jan 2020 15:25:56 GMT
Title: An efficient automated data analytics approach to large scale computational comparative linguistics
Authors: Gabija Mikulyte and David Gilbert
Abstract summary: This research project aimed to overcome the challenge of analysing human language relationships. It developed automated comparison techniques based on the phonetic representation of certain key words and concept. It led to the development of a workflow which was later implemented by combining Unix shell scripts, a developed R package and SWI Prolog.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This research project aimed to overcome the challenge of analysing human language relationships, facilitate the grouping of languages and formation of genealogical relationship between them by developing automated comparison techniques. Techniques were based on the phonetic representation of certain key words and concept. Example word sets included numbers 1-10 (curated), large database of numbers 1-10 and sheep counting numbers 1-10 (other sources), colours (curated), basic words (curated). To enable comparison within the sets the measure of Edit distance was calculated based on Levenshtein distance metric. This metric between two strings is the minimum number of single-character edits, operations including: insertions, deletions or substitutions. To explore which words exhibit more or less variation, which words are more preserved and examine how languages could be grouped based on linguistic distances within sets, several data analytics techniques were involved. Those included density evaluation, hierarchical clustering, silhouette, mean, standard deviation and Bhattacharya coefficient calculations. These techniques lead to the development of a workflow which was later implemented by combining Unix shell scripts, a developed R package and SWI Prolog. This proved to be computationally efficient and permitted the fast exploration of large language sets and their analysis.

Related papers

HJ-Ky-0.1: an Evaluation Dataset for Kyrgyz Word Embeddings [1.1920184024241331]
This work introduces the first'silver standard' dataset for constructing word vector representations in the Kyrgyz language. We train corresponding models and validate the dataset's suitability through quality evaluation metrics.
arXiv Detail & Related papers (2024-11-16T07:14:32Z)
Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis. Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z)
Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores [30.12630686473324]
We find that compression algorithms capture information similar to what is measured by slow to compute $n$-gram overlap scores. The applicability of scores extends beyond analysis of generative models.
arXiv Detail & Related papers (2024-03-01T14:23:12Z)
Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects. In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts. We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z)
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents. We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z)
Lexical Complexity Prediction: An Overview [13.224233182417636]
The occurrence of unknown words in texts significantly hinders reading comprehension. computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives. We present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data.
arXiv Detail & Related papers (2023-03-08T19:35:08Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
Sentiment Classification of Code-Switched Text using Pre-trained Multilingual Embeddings and Segmentation [1.290382979353427]
We propose a multi-step natural language processing algorithm for code-switched sentiment analysis. The proposed algorithm can be expanded for sentiment analysis of multiple languages with limited human expertise.
arXiv Detail & Related papers (2022-10-29T01:52:25Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)
Comparative analysis of word embeddings in assessing semantic similarity of complex sentences [8.873705500708196]
We study the sentences in existing benchmark datasets and analyze the sensitivity of various word embeddings with respect to the complexity of the sentences. The results show the increase in complexity of the sentences has a significant impact on the performance of the embedding models.
arXiv Detail & Related papers (2020-10-23T19:55:11Z)
Phonotactic Complexity and its Trade-offs [73.10961848460613]
This simple measure allows us to compare the entropy across languages. We demonstrate a very strong negative correlation of -0.74 between bits per phoneme and the average length of words.
arXiv Detail & Related papers (2020-05-07T21:36:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.