Logographic Information Aids Learning Better Representations for Natural
Language Inference
- URL: http://arxiv.org/abs/2211.02136v1
- Date: Thu, 3 Nov 2022 20:40:14 GMT
- Title: Logographic Information Aids Learning Better Representations for Natural
Language Inference
- Authors: Zijian Jin, Duygu Ataman
- Abstract summary: We present a novel study which explores the benefits of providing language models with logographic information in learning better semantic representations.
Our evaluation results in six languages suggest significant benefits of using multi-modal embeddings in languages with logograhic systems.
- Score: 3.677231059555795
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Statistical language models conventionally implement representation learning
based on the contextual distribution of words or other formal units, whereas
any information related to the logographic features of written text are often
ignored, assuming they should be retrieved relying on the cooccurence
statistics. On the other hand, as language models become larger and require
more data to learn reliable representations, such assumptions may start to fall
back, especially under conditions of data sparsity. Many languages, including
Chinese and Vietnamese, use logographic writing systems where surface forms are
represented as a visual organization of smaller graphemic units, which often
contain many semantic cues. In this paper, we present a novel study which
explores the benefits of providing language models with logographic information
in learning better semantic representations. We test our hypothesis in the
natural language inference (NLI) task by evaluating the benefit of computing
multi-modal representations that combine contextual information with glyph
information. Our evaluation results in six languages with different typology
and writing systems suggest significant benefits of using multi-modal
embeddings in languages with logograhic systems, especially for words with less
occurence statistics.
Related papers
- Analyzing The Language of Visual Tokens [48.62180485759458]
We take a natural-language-centric approach to analyzing discrete visual languages.
We show that higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts.
We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages.
arXiv Detail & Related papers (2024-11-07T18:59:28Z) - LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP [30.804518354947565]
A large portion of logographic data persists in a purely visual form due to the absence of transcription.
This issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages.
We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages.
arXiv Detail & Related papers (2024-08-08T17:58:06Z) - Language Embeddings Sometimes Contain Typological Generalizations [0.0]
We train neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1295 languages.
The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features.
We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations.
arXiv Detail & Related papers (2023-01-19T15:09:59Z) - Probing Linguistic Information For Logical Inference In Pre-trained
Language Models [2.4366811507669124]
We propose a methodology for probing linguistic information for logical inference in pre-trained language model representations.
We find that (i) pre-trained language models do encode several types of linguistic information for inference, but there are also some types of information that are weakly encoded.
We have demonstrated language models' potential as semantic and background knowledge bases for supporting symbolic inference methods.
arXiv Detail & Related papers (2021-12-03T07:19:42Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Syntax Representation in Word Embeddings and Neural Networks -- A Survey [4.391102490444539]
This paper covers approaches of evaluating the amount of syntactic information included in the representations of words.
We mainly summarize re-search on English monolingual data on language modeling tasks.
We describe which pre-trained models and representations of language are best suited for transfer to syntactic tasks.
arXiv Detail & Related papers (2020-10-02T15:44:58Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.