Related papers: Sneaking Syntax into Transformer Language Models with Tree Regularization

Sneaking Syntax into Transformer Language Models with Tree Regularization

URL: http://arxiv.org/abs/2411.18885v1
Date: Thu, 28 Nov 2024 03:27:48 GMT
Title: Sneaking Syntax into Transformer Language Models with Tree Regularization
Authors: Ananjan Nandi, Christopher D. Manning, Shikhar Murty,
Abstract summary: Introducing syntactic inductive biases could unlock more robust and data-efficient learning in transformer language models. Existing methods for incorporating such structure greatly restrict models. TreeReg mitigates degradation of performance on adversarial NLI benchmarks by 41.2 points.
Score: 33.74552367356904
License:
Abstract: While compositional accounts of human language understanding are based on a hierarchical tree-like process, neural models like transformers lack a direct inductive bias for such tree structures. Introducing syntactic inductive biases could unlock more robust and data-efficient learning in transformer language models (LMs), but existing methods for incorporating such structure greatly restrict models, either limiting their expressivity or increasing inference complexity. This work instead aims to softly inject syntactic inductive biases into given transformer circuits, through a structured regularizer. We introduce TREEREG, an auxiliary loss function that converts bracketing decisions from silver parses into a set of differentiable orthogonality constraints on vector hidden states. TREEREG integrates seamlessly with the standard LM objective, requiring no architectural changes. LMs pre-trained with TreeReg on natural language corpora such as WikiText-103 achieve up to 10% lower perplexities on out-of-distribution data and up to 9.5 point improvements in syntactic generalization, requiring less than half the training data to outperform standard LMs. TreeReg still provides gains for pre-trained LLMs: Continued pre-training of Sheared Llama with TreeReg results in improved syntactic generalization, and fine-tuning on MultiNLI with TreeReg mitigates degradation of performance on adversarial NLI benchmarks by 41.2 points.

Related papers

Soft regression trees: a model variant and a decomposition training algorithm [0.24578723416255752]
We propose a new variant of soft multivariate regression trees (SRTs) where, for every input vector, the prediction is defined as a linear regression associated to a single leaf node. SRTs exhibit the conditional computational property, i.e., each prediction depends on a small number of nodes. Experiments on 15 wellknown datasets indicate that our SRTs and decomposition algorithm yield higher accuracy and robustness compared with traditional soft regression trees.
arXiv Detail & Related papers (2025-01-10T13:06:36Z)
Tree-Planted Transformers: Unidirectional Transformer Language Models with Implicit Syntactic Supervision [4.665860995185884]
We propose a new method dubbed tree-planting. Instead of explicitly generating syntactic structures, we "plant" trees into attention weights of unidirectional Transformer LMs. Tree-Planted Transformers inherit the training efficiency from SLMs without changing the inference efficiency of their underlying Transformer LMs.
arXiv Detail & Related papers (2024-02-20T03:37:24Z)
Language Models as Hierarchy Encoders [22.03504018330068]
We introduce a novel approach to re-train transformer encoder-based LMs as Hierarchy Transformer encoders (HiTs) Our method situates the output embedding space of pre-trained LMs within a Poincar'e ball with a curvature that adapts to the embedding dimension. We evaluate HiTs against pre-trained LMs, standard fine-tuned LMs, and several hyperbolic embedding baselines.
arXiv Detail & Related papers (2024-01-21T02:29:12Z)
Differentiable Tree Operations Promote Compositional Generalization [106.59434079287661]
Differentiable Tree Machine (DTM) architecture integrates interpreter with external memory and agent that learns to sequentially select tree operations. DTM achieves 100% while existing baselines such as Transformer, Tree Transformer, LSTM, and Tree2Tree LSTM achieve less than 30%.
arXiv Detail & Related papers (2023-06-01T14:46:34Z)
Characterizing Intrinsic Compositionality in Transformers with Tree Projections [72.45375959893218]
neural models like transformers can route information arbitrarily between different parts of their input. We show that transformers for three different tasks become more treelike over the course of training. These trees are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.
arXiv Detail & Related papers (2022-11-02T17:10:07Z)
Inducing Transformer's Compositional Generalization Ability via Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions. Existing neural models have been shown to lack this basic ability in learning symbolic structures. We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z)
Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex. This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z)
Recursive Top-Down Production for Sentence Generation with Latent Trees [77.56794870399288]
We model the production property of context-free grammars for natural and synthetic languages. We present a dynamic programming algorithm that marginalises over latent binary tree structures with $N$ leaves. We also present experimental results on German-English translation on the Multi30k dataset.
arXiv Detail & Related papers (2020-10-09T17:47:16Z)
Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.