Detecting Structured Language Alternations in Historical Documents by
Combining Language Identification with Fourier Analysis
- URL: http://arxiv.org/abs/2401.14569v1
- Date: Thu, 25 Jan 2024 23:54:34 GMT
- Title: Detecting Structured Language Alternations in Historical Documents by
Combining Language Identification with Fourier Analysis
- Authors: Hale Sirin, Sabrina Li, Tom Lippincott
- Abstract summary: We introduce the task of detecting distinct patterns of multilinguality based on the frequency of structured language alternations within a document.
We present a generalizable workflow to identify documents in a historic language with a nonstandard language and script combination, Armeno-Turkish.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we present a generalizable workflow to identify documents in a
historic language with a nonstandard language and script combination,
Armeno-Turkish. We introduce the task of detecting distinct patterns of
multilinguality based on the frequency of structured language alternations
within a document.
Related papers
- Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement [1.4335183427838039]
We take the approach of developing curated synthetic data on a large scale, with specific properties.
We use a new multiple-choice task and datasets, Blackbird Language Matrices, to focus on a specific grammatical structural phenomenon.
We show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences.
arXiv Detail & Related papers (2024-09-10T14:58:55Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Language Lexicons for Hindi-English Multilingual Text Processing [0.0]
The present Language Identification techniques presume that a document contains text in one of the fixed set of languages.
Due to the unavailability of large standard corpora for Hindi-English mixed lingual language processing tasks we propose the language lexicons.
These lexicons are built by learning classifiers over transliterated Hindi and English vocabulary.
arXiv Detail & Related papers (2021-06-29T05:42:54Z) - Scalable Cross-lingual Document Similarity through Language-specific
Concept Hierarchies [0.0]
This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora.
The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels.
Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.
arXiv Detail & Related papers (2020-12-15T10:42:40Z) - Automatically Identifying Language Family from Acoustic Examples in Low
Resource Scenarios [48.57072884674938]
We propose a method to analyze language similarity using deep learning.
Namely, we train a model on the Wilderness dataset and investigate how its latent space compares with classical language family findings.
arXiv Detail & Related papers (2020-12-01T22:44:42Z) - Cross-Lingual Document Retrieval with Smooth Learning [31.638708227607214]
Cross-lingual document search is an information retrieval task in which the queries' language differs from the documents' language.
We propose a novel end-to-end robust framework that achieves improved performance in cross-lingual search with different documents' languages.
arXiv Detail & Related papers (2020-11-02T03:17:39Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.