An efficient automated data analytics approach to large scale
computational comparative linguistics
- URL: http://arxiv.org/abs/2001.11899v1
- Date: Fri, 31 Jan 2020 15:25:56 GMT
- Title: An efficient automated data analytics approach to large scale
computational comparative linguistics
- Authors: Gabija Mikulyte and David Gilbert
- Abstract summary: This research project aimed to overcome the challenge of analysing human language relationships.
It developed automated comparison techniques based on the phonetic representation of certain key words and concept.
It led to the development of a workflow which was later implemented by combining Unix shell scripts, a developed R package and SWI Prolog.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This research project aimed to overcome the challenge of analysing human
language relationships, facilitate the grouping of languages and formation of
genealogical relationship between them by developing automated comparison
techniques. Techniques were based on the phonetic representation of certain key
words and concept. Example word sets included numbers 1-10 (curated), large
database of numbers 1-10 and sheep counting numbers 1-10 (other sources),
colours (curated), basic words (curated).
To enable comparison within the sets the measure of Edit distance was
calculated based on Levenshtein distance metric. This metric between two
strings is the minimum number of single-character edits, operations including:
insertions, deletions or substitutions. To explore which words exhibit more or
less variation, which words are more preserved and examine how languages could
be grouped based on linguistic distances within sets, several data analytics
techniques were involved. Those included density evaluation, hierarchical
clustering, silhouette, mean, standard deviation and Bhattacharya coefficient
calculations. These techniques lead to the development of a workflow which was
later implemented by combining Unix shell scripts, a developed R package and
SWI Prolog. This proved to be computationally efficient and permitted the fast
exploration of large language sets and their analysis.
Related papers
- Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - Standardizing the Measurement of Text Diversity: A Tool and a
Comparative Analysis of Scores [30.12630686473324]
We find that compression algorithms capture information similar to what is measured by slow to compute $n$-gram overlap scores.
The applicability of scores extends beyond analysis of generative models.
arXiv Detail & Related papers (2024-03-01T14:23:12Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Lexical Complexity Prediction: An Overview [13.224233182417636]
The occurrence of unknown words in texts significantly hinders reading comprehension.
computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives.
We present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data.
arXiv Detail & Related papers (2023-03-08T19:35:08Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Sentiment Classification of Code-Switched Text using Pre-trained
Multilingual Embeddings and Segmentation [1.290382979353427]
We propose a multi-step natural language processing algorithm for code-switched sentiment analysis.
The proposed algorithm can be expanded for sentiment analysis of multiple languages with limited human expertise.
arXiv Detail & Related papers (2022-10-29T01:52:25Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Comparative analysis of word embeddings in assessing semantic similarity
of complex sentences [8.873705500708196]
We study the sentences in existing benchmark datasets and analyze the sensitivity of various word embeddings with respect to the complexity of the sentences.
The results show the increase in complexity of the sentences has a significant impact on the performance of the embedding models.
arXiv Detail & Related papers (2020-10-23T19:55:11Z) - Phonotactic Complexity and its Trade-offs [73.10961848460613]
This simple measure allows us to compare the entropy across languages.
We demonstrate a very strong negative correlation of -0.74 between bits per phoneme and the average length of words.
arXiv Detail & Related papers (2020-05-07T21:36:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.