Scale leads to compositional generalization
- URL: http://arxiv.org/abs/2507.07207v1
- Date: Wed, 09 Jul 2025 18:30:50 GMT
- Title: Scale leads to compositional generalization
- Authors: Florian Redhardt, Yassir Akram, Simon Schug,
- Abstract summary: We show that scaling data and model size leads to compositional generalization.<n>We show that this holds across different task encodings as long as the training distribution sufficiently covers the task space.<n>We uncover that if networks successfully compositionally generalize, the constituents of a task can be linearly decoded from their hidden activations.
- Score: 1.287456940851492
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Can neural networks systematically capture discrete, compositional task structure despite their continuous, distributed nature? The impressive capabilities of large-scale neural networks suggest that the answer to this question is yes. However, even for the most capable models, there are still frequent failure cases that raise doubts about their compositionality. Here, we seek to understand what it takes for a standard neural network to generalize over tasks that share compositional structure. We find that simply scaling data and model size leads to compositional generalization. We show that this holds across different task encodings as long as the training distribution sufficiently covers the task space. In line with this finding, we prove that standard multilayer perceptrons can approximate a general class of compositional task families to arbitrary precision using only a linear number of neurons with respect to the number of task modules. Finally, we uncover that if networks successfully compositionally generalize, the constituents of a task can be linearly decoded from their hidden activations. We show that this metric correlates with failures of text-to-image generation models to compose known concepts.
Related papers
- Coding schemes in neural networks learning classification tasks [52.22978725954347]
We investigate fully-connected, wide neural networks learning classification tasks.
We show that the networks acquire strong, data-dependent features.
Surprisingly, the nature of the internal representations depends crucially on the neuronal nonlinearity.
arXiv Detail & Related papers (2024-06-24T14:50:05Z) - Attention as a Hypernetwork [22.087242869138223]
Transformers can generalize to novel problem instances whose constituent parts might have been encountered during training, but whose compositions have not.<n>By reformulating multi-head attention as a hypernetwork, we reveal that a composable, low-dimensional latent code specifies key-Query specific operations.<n>We find that this latent code is predictive of the subtasks the network performs on unseen task compositions, revealing that latent codes acquired during training are reused to solve unseen problem instances.
arXiv Detail & Related papers (2024-06-09T15:08:00Z) - When does compositional structure yield compositional generalization? A kernel theory [0.0]
We present a theory of compositional generalization in kernel models with fixed, compositionally structured representations.<n>We identify novel failure modes in compositional generalization that arise from biases in the training data.<n>This work examines how statistical structure in the training data can affect compositional generalization.
arXiv Detail & Related papers (2024-05-26T00:50:11Z) - How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model [47.617093812158366]
We introduce the Random Hierarchy Model: a family of synthetic tasks inspired by the hierarchical structure of language and images.
We find that deep networks learn the task by developing internal representations invariant to exchanging equivalent groups.
Our results indicate how deep networks overcome the curse of dimensionality by building invariant representations.
arXiv Detail & Related papers (2023-07-05T09:11:09Z) - Neural Networks and the Chomsky Hierarchy [27.470857324448136]
We study whether insights from the theory of Chomsky can predict the limits of neural network generalization in practice.
We show negative results where even extensive amounts of data and training time never led to any non-trivial generalization.
Our results show that, for our subset of tasks, RNNs and Transformers fail to generalize on non-regular tasks, and only networks augmented with structured memory can successfully generalize on context-free and context-sensitive tasks.
arXiv Detail & Related papers (2022-07-05T15:06:11Z) - Spectral Complexity-scaled Generalization Bound of Complex-valued Neural
Networks [78.64167379726163]
This paper is the first work that proves a generalization bound for the complex-valued neural network.
We conduct experiments by training complex-valued convolutional neural networks on different datasets.
arXiv Detail & Related papers (2021-12-07T03:25:25Z) - Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules.
inputs to the model are routed through a sequence of functions in a way that is end-to-end learned.
We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z) - Robust Generalization of Quadratic Neural Networks via Function
Identification [19.87036824512198]
Generalization bounds from learning theory often assume that the test distribution is close to the training distribution.
We show that for quadratic neural networks, we can identify the function represented by the model even though we cannot identify its parameters.
arXiv Detail & Related papers (2021-09-22T18:02:00Z) - Generalization in Multimodal Language Learning from Simulation [20.751952728808153]
We investigate the influence of the underlying training data distribution on generalization in a minimal LSTM-based network trained in a supervised, time continuous setting.
We find compositional generalization to fail in simple setups while improving with the number of objects, actions, and particularly with a lot of color overlaps between objects.
arXiv Detail & Related papers (2021-08-03T12:55:18Z) - The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability.
We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z) - Neural Complexity Measures [96.06344259626127]
We propose Neural Complexity (NC), a meta-learning framework for predicting generalization.
Our model learns a scalar complexity measure through interactions with many heterogeneous tasks in a data-driven way.
arXiv Detail & Related papers (2020-08-07T02:12:10Z) - Neural networks adapting to datasets: learning network size and topology [77.34726150561087]
We introduce a flexible setup allowing for a neural network to learn both its size and topology during the course of a gradient-based training.
The resulting network has the structure of a graph tailored to the particular learning task and dataset.
arXiv Detail & Related papers (2020-06-22T12:46:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.