Related papers: Generalization on the Unseen, Logic Reasoning and Degree Curriculum

Generalization on the Unseen, Logic Reasoning and Degree Curriculum

URL: http://arxiv.org/abs/2301.13105v3
Date: Wed, 20 Nov 2024 17:16:01 GMT
Title: Generalization on the Unseen, Logic Reasoning and Degree Curriculum
Authors: Emmanuel Abbe, Samy Bengio, Aryo Lotfi, Kevin Rizk,
Abstract summary: This paper considers the learning of logical (Boolean) functions with a focus on the generalization on the unseen (GOTU) setting. We study how different network architectures trained by (S)GD perform under GOTU. More specifically, this means an interpolator of the training data that has minimal Fourier mass on the higher degree basis elements.
Score: 25.7378861650474
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper considers the learning of logical (Boolean) functions with a focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an 'extrapolating' or 'reasoning' learner. We study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for sparse functions and a class of network models including instances of Transformers, random features models, and linear networks, a min-degree-interpolator is learned on the unseen. More specifically, this means an interpolator of the training data that has minimal Fourier mass on the higher degree basis elements. These findings lead to two implications: (1) we provide an explanation to the length generalization problem for Boolean functions (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports. Finally, we discuss extensions to other models or non-sparse regimes where the min-degree bias may still occur or fade, as well as how it can be potentially corrected when undesirable.

Related papers

TRACE: Learning to Compute on Graphs [15.34239150750753]
We introduce textbfTRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective.<n>First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation.<n>Second, we introduce textbffunction shift learning, a novel objective that decouples the learning problem.
arXiv Detail & Related papers (2025-09-26T05:22:32Z)
Characterising the Inductive Biases of Neural Networks on Boolean Data [0.46180371154032906]
We provide an end-to-end, analytically tractable case study that links a network's inductive prior, its training dynamics including feature learning, and its eventual generalisation.<n>Under a Monte Carlo learning algorithm, our model exhibits predictable training dynamics and the emergence of interpretable features.
arXiv Detail & Related papers (2025-05-29T23:03:33Z)
On the Minimal Degree Bias in Generalization on the Unseen for non-Boolean Functions [19.203590688200777]
We investigate the out-of-domain generalization of random feature (RF) models and Transformers. We first prove that in the generalization on the unseen (GOTU)' setting, the convergence takes place to interpolators of minimal degree. We then consider the sparse target regime and explain how this regime relates to the small feature regime.
arXiv Detail & Related papers (2024-06-10T15:14:33Z)
What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding [67.59552859593985]
Graph Transformers, which incorporate self-attention and positional encoding, have emerged as a powerful architecture for various graph learning tasks. This paper introduces first theoretical investigation of a shallow Graph Transformer for semi-supervised classification.
arXiv Detail & Related papers (2024-06-04T05:30:16Z)
Improving Complex Reasoning over Knowledge Graph with Logic-Aware Curriculum Tuning [89.89857766491475]
We propose a complex reasoning schema over KG upon large language models (LLMs) We augment the arbitrary first-order logical queries via binary tree decomposition to stimulate the reasoning capability of LLMs. Experiments across widely used datasets demonstrate that LACT has substantial improvements(brings an average +5.5% MRR score) over advanced methods.
arXiv Detail & Related papers (2024-05-02T18:12:08Z)
RAGFormer: Learning Semantic Attributes and Topological Structure for Fraud Detection [8.050935113945428]
We present a novel framework called Relation-Aware GNN with transFormer(RAGFormer) RAGFormer embeds both semantic and topological features into a target node. The simple yet effective network consists of a semantic encoder, a topology encoder, and an attention fusion module.
arXiv Detail & Related papers (2024-02-27T12:53:15Z)
Counterfactual Intervention Feature Transfer for Visible-Infrared Person Re-identification [69.45543438974963]
We find graph-based methods in the visible-infrared person re-identification task (VI-ReID) suffer from bad generalization because of two issues. The well-trained input features weaken the learning of graph topology, making it not generalized enough during the inference process. We propose a Counterfactual Intervention Feature Transfer (CIFT) method to tackle these problems.
arXiv Detail & Related papers (2022-08-01T16:15:31Z)
Towards Sample-efficient Overparameterized Meta-learning [37.676063120293044]
An overarching goal in machine learning is to build a generalizable model with few samples. This paper aims to demystify over parameterization for meta-learning. We show that learning the optimal representation coincides with the problem of designing a task-aware regularization.
arXiv Detail & Related papers (2022-01-16T21:57:17Z)
Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules. inputs to the model are routed through a sequence of functions in a way that is end-to-end learned. We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z)
Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning. Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z)
Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks. We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task. Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.