Pairing Orthographically Variant Literary Words to Standard Equivalents
Using Neural Edit Distance Models
- URL: http://arxiv.org/abs/2401.15068v1
- Date: Fri, 26 Jan 2024 18:49:34 GMT
- Title: Pairing Orthographically Variant Literary Words to Standard Equivalents
Using Neural Edit Distance Models
- Authors: Craig Messner and Tom Lippincott
- Abstract summary: We present a novel corpus consisting of orthographically variant words found in works of 19th century U.S. literature annotated with their corresponding "standard" word pair.
We train a set of neural edit distance models to pair these variants with their standard forms, and compare the performance of these models to the performance of a set of neural edit distance models trained on a corpus of orthographic errors made by L2 English learners.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel corpus consisting of orthographically variant words found
in works of 19th century U.S. literature annotated with their corresponding
"standard" word pair. We train a set of neural edit distance models to pair
these variants with their standard forms, and compare the performance of these
models to the performance of a set of neural edit distance models trained on a
corpus of orthographic errors made by L2 English learners. Finally, we analyze
the relative performance of these models in the light of different negative
training sample generation strategies, and offer concluding remarks on the
unique challenge literary orthographic variation poses to string pairing
methodologies.
Related papers
- Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus [0.0]
We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags.
We find indications that the "dialect effect" produced by intentional orthographic variation employs multiple linguistic channels.
arXiv Detail & Related papers (2024-10-03T16:58:21Z) - Modeling Orthographic Variation in Occitan's Dialects [3.038642416291856]
Large multilingual models minimize the need for spelling normalization during pre-processing.
Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
arXiv Detail & Related papers (2024-04-30T07:33:51Z) - The Scenario Refiner: Grounding subjects in images at the morphological
level [2.401993998791928]
We ask whether Vision and Language (V&L) models capture such distinctions at the morphological level.
We compare the results from V&L models to human judgements and find that models' predictions differ from those of human participants.
arXiv Detail & Related papers (2023-09-20T12:23:06Z) - Visual Comparison of Language Model Adaptation [55.92129223662381]
adapters are lightweight alternatives for model adaptation.
In this paper, we discuss several design and alternatives for interactive, comparative visual explanation methods.
We show that, for instance, an adapter trained on the language debiasing task according to context-0 embeddings introduces a new type of bias.
arXiv Detail & Related papers (2022-08-17T09:25:28Z) - SLUA: A Super Lightweight Unsupervised Word Alignment Model via
Cross-Lingual Contrastive Learning [79.91678610678885]
We propose a super lightweight unsupervised word alignment model (SLUA)
Experimental results on several public benchmarks demonstrate that our model achieves competitive, if not better, performance.
Notably, we recognize our model as a pioneer attempt to unify bilingual word embedding and word alignments.
arXiv Detail & Related papers (2021-02-08T05:54:11Z) - Morphologically Aware Word-Level Translation [82.59379608647147]
We propose a novel morphologically aware probability model for bilingual lexicon induction.
Our model exploits the basic linguistic intuition that the lexeme is the key lexical unit of meaning.
arXiv Detail & Related papers (2020-11-15T17:54:49Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z) - Neural Baselines for Word Alignment [0.0]
We study and evaluate neural models for unsupervised word alignment for four language pairs.
We show that neural versions of the IBM-1 and hidden Markov models vastly outperform their discrete counterparts.
arXiv Detail & Related papers (2020-09-28T07:51:03Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Overestimation of Syntactic Representationin Neural Language Models [16.765097098482286]
One popular method for determining a model's ability to induce syntactic structure trains a model on strings generated according to a template then tests the model's ability to distinguish such strings from superficially similar ones with different syntax.
We illustrate a fundamental problem with this approach by reproducing positive results from a recent paper with two non-syntactic baseline language models.
arXiv Detail & Related papers (2020-04-10T15:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.