Phonetic and Visual Priors for Decipherment of Informal Romanization
- URL: http://arxiv.org/abs/2005.02517v1
- Date: Tue, 5 May 2020 21:57:27 GMT
- Title: Phonetic and Visual Priors for Decipherment of Informal Romanization
- Authors: Maria Ryskina, Matthew R. Gormley, Taylor Berg-Kirkpatrick
- Abstract summary: We propose a noisy-channel WFST cascade model for deciphering the original non-Latin script from observed romanized text.
We train our model directly on romanized data from two languages: Egyptian Arabic and Russian.
We demonstrate that adding inductive bias through phonetic and visual priors on character mappings substantially improves the model's performance on both languages.
- Score: 37.77170643560608
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Informal romanization is an idiosyncratic process used by humans in informal
digital communication to encode non-Latin script languages into Latin character
sets found on common keyboards. Character substitution choices differ between
users but have been shown to be governed by the same main principles observed
across a variety of languages---namely, character pairs are often associated
through phonetic or visual similarity. We propose a noisy-channel WFST cascade
model for deciphering the original non-Latin script from observed romanized
text in an unsupervised fashion. We train our model directly on romanized data
from two languages: Egyptian Arabic and Russian. We demonstrate that adding
inductive bias through phonetic and visual priors on character mappings
substantially improves the model's performance on both languages, yielding
results much closer to the supervised skyline. Finally, we introduce a new
dataset of romanized Russian, collected from a Russian social network website
and partially annotated for our experiments.
Related papers
- Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus [0.0]
We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags.
We find indications that the "dialect effect" produced by intentional orthographic variation employs multiple linguistic channels.
arXiv Detail & Related papers (2024-10-03T16:58:21Z) - RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization [17.46921734622369]
Romanized text reduces token fertility by 2x-4x.
Romanized text matches or outperforms native script representation across various NLU, NLG, and MT tasks.
arXiv Detail & Related papers (2024-01-25T16:11:41Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - English-to-Chinese Transliteration with Phonetic Back-transliteration [0.9281671380673306]
Transliteration is a task of translating named entities from a language to another, based on phonetic similarity.
In this work, we incorporate phonetic information into neural networks in two ways: we synthesize extra data using forward and back-translation but in a phonetic manner.
Our experiments include three language pairs and six directions, namely English to and from Chinese, Hebrew and Thai.
arXiv Detail & Related papers (2021-12-20T03:29:28Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Deciphering Undersegmented Ancient Scripts Using Phonetic Prior [31.707254394215283]
Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges.
We propose a model that handles both of these challenges by building on rich linguistic constraints.
We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian)
arXiv Detail & Related papers (2020-10-21T15:03:52Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - Latin BERT: A Contextual Language Model for Classical Philology [7.513100214864645]
We present Latin BERT, a contextual language model for the Latin language.
It was trained on 642.7 million words from a variety of sources spanning the Classical era to the 21st century.
arXiv Detail & Related papers (2020-09-21T17:47:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.