Related papers: Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning

Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning

URL: http://arxiv.org/abs/2410.08255v1
Date: Thu, 10 Oct 2024 16:23:42 GMT
Title: Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning
Authors: David D. Baek, Yuxiao Li, Max Tegmark,
Abstract summary: We investigate how neural networks represent knowledge during graph learning. We find hints of universality, where equivalent representations are learned across a range of model sizes. We show that these attractor representations optimize generalization to unseen examples.
Score: 8.025148264640967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Motivated by interpretability and reliability, we investigate how neural networks represent knowledge during graph learning, We find hints of universality, where equivalent representations are learned across a range of model sizes (from $10^2$ to $10^9$ parameters) and contexts (MLP toy models, LLM in-context learning and LLM training). We show that these attractor representations optimize generalization to unseen examples by exploiting properties of knowledge graph relations (e.g. symmetry and meta-transitivity). We find experimental support for such universality by showing that LLMs and simpler neural networks can be stitched, i.e., by stitching the first part of one model to the last part of another, mediated only by an affine or almost affine transformation. We hypothesize that this dynamic toward simplicity and generalization is driven by "intelligence from starvation": where overfitting is minimized by pressure to minimize the use of resources that are either scarce or competed for against other tasks.

Related papers

A Markov Categorical Framework for Language Modeling [5.980612601840882]
Auto-regressive language models factorize sequence probabilities and are trained by minimizing the negative log-likelihood (NLL) objective.<n>This work introduces a unifying analytical framework using Markov Categories (MCs) to deconstruct the AR generation process and the NLL objective.<n>By analyzing the information geometry of the model's prediction head, we show that NLL implicitly forces the learned representation space to align with the eigenspectrum of a similarity predictive operator.
arXiv Detail & Related papers (2025-07-25T13:14:03Z)
Probing Neural Topology of Large Language Models [15.34202977968525]
We introduce graph probing, a method for uncovering the functional connectivity topology of LLM neurons.<n>We find a universal predictability of next-token prediction performance using only neural topology.<n>This predictability is robust even when retaining just 1% of neuron connections or probing models after only 8 pretraining steps.
arXiv Detail & Related papers (2025-06-01T14:57:03Z)
Graph Learning in the Era of LLMs: A Survey from the Perspective of Data, Models, and Tasks [25.720233631885726]
integration of Graph Neural Networks (GNNs) and Large Language Models (LLMs) has emerged as a promising technological paradigm. We leverage graph description texts with rich semantic context to fundamentally enhance Data quality. This work serves as a foundational reference for researchers and practitioners looking to advance graph learning methodologies.
arXiv Detail & Related papers (2024-12-17T01:41:17Z)
Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples. We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer. We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z)
Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models [42.17166746027585]
We introduce a bidirectional weighted graph-based framework to learn factorized attributes and their interrelations within complex data. Specifically, we propose a $beta$-VAE based module to extract factors as the initial nodes of the graph. By integrating these complementary modules, our model successfully achieves fine-grained, practical and unsupervised disentanglement.
arXiv Detail & Related papers (2024-07-26T15:32:21Z)
Learning from Teaching Regularization: Generalizable Correlations Should be Easy to Imitate [40.5601980891318]
Generalization remains a central challenge in machine learning. We propose Learning from Teaching (LoT), a novel regularization technique for deep neural networks to enhance generalization. LoT operationalizes this concept to improve the generalization of the main model with auxiliary student learners.
arXiv Detail & Related papers (2024-02-05T07:05:17Z)
Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners. We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting. Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z)
Quantifying the Optimization and Generalization Advantages of Graph Neural Networks Over Multilayer Perceptrons [50.33260238739837]
Graph networks (GNNs) have demonstrated remarkable capabilities in learning from graph-structured data.<n>There remains a lack of analysis comparing GNNs and generalizations from an optimization and generalization perspective.
arXiv Detail & Related papers (2023-06-24T10:21:11Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
Generalization on the Unseen, Logic Reasoning and Degree Curriculum [25.7378861650474]
This paper considers the learning of logical (Boolean) functions with a focus on the generalization on the unseen (GOTU) setting. We study how different network architectures trained by (S)GD perform under GOTU. More specifically, this means an interpolator of the training data that has minimal Fourier mass on the higher degree basis elements.
arXiv Detail & Related papers (2023-01-30T17:44:05Z)
Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules. inputs to the model are routed through a sequence of functions in a way that is end-to-end learned. We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z)
How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks [80.55378250013496]
We study how neural networks trained by gradient descent extrapolate what they learn outside the support of the training distribution. Graph Neural Networks (GNNs) have shown some success in more complex tasks.
arXiv Detail & Related papers (2020-09-24T17:48:59Z)
FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs [53.710405006523274]
This work focuses on the representation learning question: how can we learn such features? Under the assumption that the underlying (unknown) dynamics correspond to a low rank transition matrix, we show how the representation learning question is related to a particular non-linear matrix decomposition problem. We develop FLAMBE, which engages in exploration and representation learning for provably efficient RL in low rank transition models.
arXiv Detail & Related papers (2020-06-18T19:11:18Z)
Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks. We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task. Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.