Character Entropy in Modern and Historical Texts: Comparison Metrics for
an Undeciphered Manuscript
- URL: http://arxiv.org/abs/2010.14697v2
- Date: Tue, 18 May 2021 23:33:40 GMT
- Title: Character Entropy in Modern and Historical Texts: Comparison Metrics for
an Undeciphered Manuscript
- Authors: Luke Lindemann and Claire Bowern
- Abstract summary: This paper outlines the creation of three corpora for multilingual comparison and analysis of the Voynich manuscript.
A corpus of Voynich texts partitioned by Currier language, scribal hand, and transcription system, a corpus of 294 language samples compiled from Wikipedia, and a corpus of eighteen transcribed historical texts in eight languages.
- Score: 0.4061135251278187
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper outlines the creation of three corpora for multilingual comparison
and analysis of the Voynich manuscript: a corpus of Voynich texts partitioned
by Currier language, scribal hand, and transcription system, a corpus of 294
language samples compiled from Wikipedia, and a corpus of eighteen transcribed
historical texts in eight languages. These corpora will be utilized in
subsequent work by the Voynich Working Group at Yale University.
We demonstrate the utility of these corpora for studying characteristics of
the Voynich script and language, with an analysis of conditional character
entropy in Voynichese. We discuss the interaction between character entropy and
language, script size and type, glyph compositionality, scribal conventions and
abbreviations, positional character variants, and bigram frequency.
This analysis characterizes the interaction between script compositionality,
character size, and predictability. We show that substantial manipulations of
glyph composition are not sufficient to align conditional entropy levels with
natural languages. The unusually predictable nature of the Voynichese script is
not attributable to a particular script or transcription system, underlying
language, or substitution cipher. Voynichese is distinct from every comparison
text in our corpora because character placement is highly constrained within
the word, and this may indicate the loss of phonemic distinctions from the
underlying language.
Related papers
- Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - SenteCon: Leveraging Lexicons to Learn Human-Interpretable Language
Representations [51.08119762844217]
SenteCon is a method for introducing human interpretability in deep language representations.
We show that SenteCon provides high-level interpretability at little to no cost to predictive performance on downstream tasks.
arXiv Detail & Related papers (2023-05-24T05:06:28Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Example-Based Machine Translation from Text to a Hierarchical
Representation of Sign Language [1.3999481573773074]
This article presents an original method for Text-to-Sign Translation.
It compensates data scarcity using a domain-specific parallel corpus of alignments between text and hierarchical formal descriptions of Sign Language videos in AZee.
Based on the detection of similarities present in the source text, the proposed algorithm exploits matches and substitutions of aligned segments to build multiple candidate translations.
The resulting translations are in the form of AZee expressions, designed to be used as input to avatar systems.
arXiv Detail & Related papers (2022-05-06T15:48:43Z) - Linking Emergent and Natural Languages via Corpus Transfer [98.98724497178247]
We propose a novel way to establish a link by corpus transfer between emergent languages and natural languages.
Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning.
We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images.
arXiv Detail & Related papers (2022-03-24T21:24:54Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - TextEssence: A Tool for Interactive Analysis of Semantic Shifts Between
Corpora [14.844685568451833]
We introduce TextEssence, an interactive system designed to enable comparative analysis of corpora using embeddings.
TextEssence includes visual, neighbor-based, and similarity-based modes of embedding analysis in a lightweight, web-based interface.
arXiv Detail & Related papers (2021-03-19T21:26:28Z) - A frame semantics based approach to comparative study of digitized
corpus [0.0]
The paper focuses on the morphologic, syntactic, and semantic annotation process of English-Arabic aligned corpus created from a digitized novels.
The present study argues that differences in motion events conceptualization across languages can be described with frame structure and frame-to-frame relations.
arXiv Detail & Related papers (2020-05-29T22:56:25Z) - Validation and Normalization of DCS corpus using Sanskrit Heritage tools
to build a tagged Gold Corpus [0.0]
The Digital Corpus of Sanskrit records around 650,000 sentences along with their morphological and lexical tagging.
The Sanskrit Heritage Engine's Reader produces all possible segmentations with morphological and lexical analyses.
arXiv Detail & Related papers (2020-05-13T19:23:43Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.