Physics of Language Models: Part 1, Learning Hierarchical Language Structures
- URL: http://arxiv.org/abs/2305.13673v3
- Date: Sun, 2 Jun 2024 23:16:50 GMT
- Title: Physics of Language Models: Part 1, Learning Hierarchical Language Structures
- Authors: Zeyuan Allen-Zhu, Yuanzhi Li,
- Abstract summary: Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.
We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
- Score: 51.68385617116854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge. Previous research has primarily explored how these models handle simple tasks like name copying or selection, and we extend this by investigating how these models grasp complex, recursive language structures defined by context-free grammars (CFGs). We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences (e.g., hundreds of tokens) that are locally ambiguous and require dynamic programming to parse. Despite this complexity, we demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it. We explore the model's internals, revealing that its hidden states precisely capture the structure of CFGs, and its attention patterns resemble the information passing in a dynamic programming algorithm. This paper also presents several corollaries, including showing why positional embedding is inferior to relative attention or rotary embedding; demonstrating that encoder-based models (e.g., BERT, deBERTa) cannot learn very deeply nested CFGs as effectively as generative models (e.g., GPT); and highlighting the necessity of adding structural and syntactic errors to the pretraining data to make the model more robust to corrupted language prefixes.
Related papers
- Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation [2.9921619703037274]
We propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing.
We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM.
We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages.
arXiv Detail & Related papers (2024-10-01T04:20:14Z) - Hidden Holes: topological aspects of language models [1.1172147007388977]
We study the evolution of topological structure in GPT based large language models across depth and time during training.
We show that the latter exhibit more topological complexity, with a distinct pattern of changes common to all natural languages but absent from synthetically generated data.
arXiv Detail & Related papers (2024-06-09T14:25:09Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - GPT Struct Me: Probing GPT Models on Narrative Entity Extraction [2.049592435988883]
We evaluate the capabilities of two state-of-the-art language models -- GPT-3 and GPT-3.5 -- in the extraction of narrative entities.
This study is conducted on the Text2Story Lusa dataset, a collection of 119 Portuguese news articles.
arXiv Detail & Related papers (2023-11-24T16:19:04Z) - OntoType: Ontology-Guided and Pre-Trained Language Model Assisted Fine-Grained Entity Typing [25.516304052884397]
Fine-grained entity typing (FET) assigns entities in text with context-sensitive, fine-grained semantic types.
OntoType follows a type ontological structure, from coarse to fine, ensembles multiple PLM prompting results to generate a set of type candidates.
Our experiments on the Ontonotes, FIGER, and NYT datasets demonstrate that our method outperforms the state-of-the-art zero-shot fine-grained entity typing methods.
arXiv Detail & Related papers (2023-05-21T00:32:37Z) - Prompting Language Models for Linguistic Structure [73.11488464916668]
We present a structured prompting approach for linguistic structured prediction tasks.
We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking.
We find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels.
arXiv Detail & Related papers (2022-11-15T01:13:39Z) - Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z) - Benchmarking Language Models for Code Syntax Understanding [79.11525961219591]
Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding.
In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs.
Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.
arXiv Detail & Related papers (2022-10-26T04:47:18Z) - Language Model Cascades [72.18809575261498]
Repeated interactions at test-time with a single model, or the composition of multiple models together, further expands capabilities.
Cases with control flow and dynamic structure require techniques from probabilistic programming.
We formalize several existing techniques from this perspective, including scratchpads / chain of thought, verifiers, STaR, selection-inference, and tool use.
arXiv Detail & Related papers (2022-07-21T07:35:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.