Geometric Signatures of Compositionality Across a Language Model's Lifetime
- URL: http://arxiv.org/abs/2410.01444v3
- Date: Fri, 07 Feb 2025 03:19:02 GMT
- Title: Geometric Signatures of Compositionality Across a Language Model's Lifetime
- Authors: Jin Hwa Lee, Thomas Jiralerspong, Lei Yu, Yoshua Bengio, Emily Cheng,
- Abstract summary: We study whether contemporary language models reflect intrinsic simplicity of language enabled by compositionality.
We find that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training.
Our analyses reveal a striking contrast between nonlinear and linear dimensionality, showing they respectively encode semantic and superficial aspects of linguistic composition.
- Score: 47.25475802128033
- License:
- Abstract: By virtue of linguistic compositionality, few syntactic rules and a finite lexicon can generate an unbounded number of sentences. That is, language, though seemingly high-dimensional, can be explained using relatively few degrees of freedom. An open question is whether contemporary language models (LMs) reflect the intrinsic simplicity of language that is enabled by compositionality. We take a geometric view of this problem by relating the degree of compositionality in a dataset to the intrinsic dimension (ID) of its representations under an LM, a measure of feature complexity. We find not only that the degree of dataset compositionality is reflected in representations' ID, but that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training. Finally, our analyses reveal a striking contrast between nonlinear and linear dimensionality, showing they respectively encode semantic and superficial aspects of linguistic composition.
Related papers
- Investigating Idiomaticity in Word Representations [9.208145117062339]
We focus on noun compounds of varying levels of idiomaticity in two languages (English and Portuguese)
We present a dataset of minimal pairs containing human idiomaticity judgments for each noun compound at both type and token levels.
We define a set of fine-grained metrics of Affinity and Scaled Similarity to determine how sensitive the models are to perturbations that may lead to changes in idiomaticity.
arXiv Detail & Related papers (2024-11-04T21:05:01Z) - Evaluating Morphological Compositional Generalization in Large Language Models [17.507983593566223]
We investigate the morphological generalization abilities of large language models (LLMs) through the lens of compositionality.
We focus on agglutinative languages such as Turkish and Finnish.
Our analysis shows that LLMs struggle with morphological compositional generalization particularly when applied to novel word roots.
While models can identify individual morphological combinations better than chance, their performance lacks systematicity, leading to significant accuracy gaps compared to humans.
arXiv Detail & Related papers (2024-10-16T15:17:20Z) - Linearity of Relation Decoding in Transformer Language Models [82.47019600662874]
Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations.
We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation.
arXiv Detail & Related papers (2023-08-17T17:59:19Z) - Recursive Neural Networks with Bottlenecks Diagnose
(Non-)Compositionality [65.60002535580298]
Quantifying compositionality of data is a challenging task, which has been investigated primarily for short utterances.
We show that comparing data's representations in models with and without a bottleneck can be used to produce a compositionality metric.
The procedure is applied to the evaluation of arithmetic expressions using synthetic data, and sentiment classification using natural language data.
arXiv Detail & Related papers (2023-01-31T15:46:39Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Are Representations Built from the Ground Up? An Empirical Examination
of Local Composition in Language Models [91.3755431537592]
Representing compositional and non-compositional phrases is critical for language understanding.
We first formulate a problem of predicting the LM-internal representations of longer phrases given those of their constituents.
While we would expect the predictive accuracy to correlate with human judgments of semantic compositionality, we find this is largely not the case.
arXiv Detail & Related papers (2022-10-07T14:21:30Z) - Morphology Without Borders: Clause-Level Morphological Annotation [8.559428282730021]
We propose to view morphology as a clause-level phenomenon, rather than word-level.
We deliver a novel dataset for clause-level morphology covering 4 typologically-different languages: English, German, Turkish and Hebrew.
Our experiments show that the clause-level tasks are substantially harder than the respective word-level tasks, while having comparable complexity across languages.
arXiv Detail & Related papers (2022-02-25T17:20:28Z) - The Combinatorics of \textit{Salva Veritate} Principles [0.0]
Concepts of grammatical compositionality arise in many theories of both natural and artificial languages.
We propose that many instances of compositionality should entail non-trivial claims about the expressive power of languages.
arXiv Detail & Related papers (2022-01-13T19:00:56Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Compositionality and Generalization in Emergent Languages [42.68870559695238]
We study whether the language emerging in deep multi-agent simulations possesses a similar ability to refer to novel primitive combinations.
We find no correlation between the degree of compositionality of an emergent language and its ability to generalize.
The more compositional a language is, the more easily it will be picked up by new learners.
arXiv Detail & Related papers (2020-04-20T08:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.