Related papers: Characterizing Intrinsic Compositionality in Transformers with Tree Projections

Characterizing Intrinsic Compositionality in Transformers with Tree Projections

URL: http://arxiv.org/abs/2211.01288v2
Date: Thu, 3 Nov 2022 22:44:21 GMT
Title: Characterizing Intrinsic Compositionality in Transformers with Tree Projections
Authors: Shikhar Murty, Pratyusha Sharma, Jacob Andreas, Christopher D. Manning
Abstract summary: neural models like transformers can route information arbitrarily between different parts of their input. We show that transformers for three different tasks become more treelike over the course of training. These trees are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.
Score: 72.45375959893218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When trained on language data, do transformers learn some arbitrary computation that utilizes the full capacity of the architecture or do they learn a simpler, tree-like computation, hypothesized to underlie compositional meaning systems like human languages? There is an apparent tension between compositional accounts of human language understanding, which are based on a restricted bottom-up computational process, and the enormous success of neural models like transformers, which can route information arbitrarily between different parts of their input. One possibility is that these models, while extremely flexible in principle, in practice learn to interpret language hierarchically, ultimately building sentence representations close to those predictable by a bottom-up, tree-structured model. To evaluate this possibility, we describe an unsupervised and parameter-free method to \emph{functionally project} the behavior of any transformer into the space of tree-structured networks. Given an input sentence, we produce a binary tree that approximates the transformer's representation-building process and a score that captures how "tree-like" the transformer's behavior is on the input. While calculation of this score does not require training any additional models, it provably upper-bounds the fit between a transformer and any tree-structured approximation. Using this method, we show that transformers for three different tasks become more tree-like over the course of training, in some cases unsupervisedly recovering the same trees as supervised parsers. These trees, in turn, are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.

Related papers

Tree Transformers are an Ineffective Model of Syntactic Constituency [0.0]
Linguists have long held that a key aspect of natural language syntax is the organization of language units into constituent structures. A number of alternative models have been proposed to provide inductive biases towards constituency, including the Tree Transformer. We investigate Tree Transformers to study whether they utilize meaningful and/or useful constituent structures.
arXiv Detail & Related papers (2024-11-25T23:53:46Z)
TreeCoders: Trees of Transformers [0.0]
We introduce TreeCoders, a novel family of transformer trees. Transformers serve as nodes, and generic classifiers learn to select the best child. TreeCoders naturally lends itself to distributed implementation.
arXiv Detail & Related papers (2024-11-11T18:40:04Z)
Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized. We find that these random transformers can perform a wide range of meaningful algorithmic tasks. Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z)
Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically [74.96551626420188]
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures. We investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge.
arXiv Detail & Related papers (2024-04-25T07:10:29Z)
Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation [52.270712965271656]
We propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective. We find that the graph of our model resembles transformers, with correspondences between dependencies and self-attention. Experiments show that our model performs competitively to transformers on small to medium sized datasets.
arXiv Detail & Related papers (2023-11-26T06:56:02Z)
Grokking of Hierarchical Structure in Vanilla Transformers [72.45375959893218]
We show that transformer language models can learn to generalize hierarchically after training for extremely long periods. intermediate-depth models generalize better than both very deep and very shallow transformers.
arXiv Detail & Related papers (2023-05-30T04:34:13Z)
Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure. The second method imbues structure at the data level by segmenting the data with morphological tokenization. We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z)
Trees in transformers: a theoretical analysis of the Transformer's ability to represent trees [6.576972696596151]
We first analyze the theoretical capability of the standard Transformer architecture to learn tree structures. This implies that a Transformer can learn tree structures well in theory. We conduct experiments with synthetic data and find that the standard Transformer achieves similar accuracy compared to a Transformer where tree position information is explicitly encoded.
arXiv Detail & Related papers (2021-12-16T00:02:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.