Signal in Noise: Exploring Meaning Encoded in Random Character Sequences
with Character-Aware Language Models
- URL: http://arxiv.org/abs/2203.07911v1
- Date: Tue, 15 Mar 2022 13:48:38 GMT
- Title: Signal in Noise: Exploring Meaning Encoded in Random Character Sequences
with Character-Aware Language Models
- Authors: Mark Chu, Bhargav Srinivasa Desikan, Ethan O. Nadler, Ruggerio L.
Sardo, Elise Darragh-Ford, and Douglas Guilbeault
- Abstract summary: We show that $n$-grams composed of random character sequences, or $garble$, provide a novel context for studying word meaning within and beyond extant language.
By studying the embeddings of a large corpus of garble, extant language, and pseudowords using CharacterBERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of $n$-grams.
- Score: 0.7454831343436739
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language processing models learn word representations based on the
distributional hypothesis, which asserts that word context (e.g.,
co-occurrence) correlates with meaning. We propose that $n$-grams composed of
random character sequences, or $garble$, provide a novel context for studying
word meaning both within and beyond extant language. In particular, randomly
generated character $n$-grams lack meaning but contain primitive information
based on the distribution of characters they contain. By studying the
embeddings of a large corpus of garble, extant language, and pseudowords using
CharacterBERT, we identify an axis in the model's high-dimensional embedding
space that separates these classes of $n$-grams. Furthermore, we show that this
axis relates to structure within extant language, including word
part-of-speech, morphology, and concept concreteness. Thus, in contrast to
studies that are mainly limited to extant language, our work reveals that
meaning and primitive information are intrinsically linked.
Related papers
- Linguistic Structure from a Bottleneck on Sequential Information Processing [5.850665541267672]
We show that natural-language-like systematicity arises in codes that are constrained by predictive information.
We show that human languages are structured to have low predictive information at the levels of phonology, morphology, syntax, and semantics.
arXiv Detail & Related papers (2024-05-20T15:25:18Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Semantic Role Labeling Meets Definition Modeling: Using Natural Language
to Describe Predicate-Argument Structures [104.32063681736349]
We present an approach to describe predicate-argument structures using natural language definitions instead of discrete labels.
Our experiments and analyses on PropBank-style and FrameNet-style, dependency-based and span-based SRL also demonstrate that a flexible model with an interpretable output does not necessarily come at the expense of performance.
arXiv Detail & Related papers (2022-12-02T11:19:16Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Exploiting Word Semantics to Enrich Character Representations of Chinese
Pre-trained Models [12.0190584907439]
We propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models.
We show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks.
arXiv Detail & Related papers (2022-07-13T02:28:08Z) - Disentangled Action Recognition with Knowledge Bases [77.77482846456478]
We aim to improve the generalization ability of the compositional action recognition model to novel verbs or novel nouns.
Previous work utilizes verb-noun compositional action nodes in the knowledge graph, making it inefficient to scale.
We propose our approach: Disentangled Action Recognition with Knowledge-bases (DARK), which leverages the inherent compositionality of actions.
arXiv Detail & Related papers (2022-07-04T20:19:13Z) - Linking Emergent and Natural Languages via Corpus Transfer [98.98724497178247]
We propose a novel way to establish a link by corpus transfer between emergent languages and natural languages.
Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning.
We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images.
arXiv Detail & Related papers (2022-03-24T21:24:54Z) - Word Order Does Matter (And Shuffled Language Models Know It) [9.990431777927421]
Recent studies have shown that language models pretrained and/or fine-tuned on randomly permuted sentences exhibit competitive performance on GLUE.
We investigate what position embeddings learned from shuffled text encode, showing that these models retain information pertaining to the original, naturalistic word order.
arXiv Detail & Related papers (2022-03-21T14:10:15Z) - Low-Dimensional Structure in the Space of Language Representations is
Reflected in Brain Responses [62.197912623223964]
We show a low-dimensional structure where language models and translation models smoothly interpolate between word embeddings, syntactic and semantic tasks, and future word embeddings.
We find that this representation embedding can predict how well each individual feature space maps to human brain responses to natural language stimuli recorded using fMRI.
This suggests that the embedding captures some part of the brain's natural language representation structure.
arXiv Detail & Related papers (2021-06-09T22:59:12Z) - Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention [19.520840812910357]
Sindhi word segmentation is a challenging task due to space omission and insertion issues.
Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features.
We propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task.
arXiv Detail & Related papers (2020-12-30T08:31:31Z) - Detecting New Word Meanings: A Comparison of Word Embedding Models in
Spanish [1.5356167668895644]
Semantic neologisms (SN) are words that acquire a new word meaning while maintaining their form.
To detect SN in a semi-automatic way, we developed a system that implements a combination of the following strategies.
We examine the following word embedding models: Word2Vec, Sense2Vec, and FastText.
arXiv Detail & Related papers (2020-01-12T21:54:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.