The Optimal Choice of Hypothesis Is the Weakest, Not the Shortest
- URL: http://arxiv.org/abs/2301.12987v4
- Date: Thu, 11 Apr 2024 05:02:10 GMT
- Title: The Optimal Choice of Hypothesis Is the Weakest, Not the Shortest
- Authors: Michael Timothy Bennett,
- Abstract summary: One strategy is to choose the shortest, equating the ability to compress information with the ability to generalise.
We show that compression is neither necessary nor sufficient to maximise performance.
We argue this demonstrates that weakness is a far better proxy, and explains why Deepmind's Apperception Engine is able to generalise effectively.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: If $A$ and $B$ are sets such that $A \subset B$, generalisation may be understood as the inference from $A$ of a hypothesis sufficient to construct $B$. One might infer any number of hypotheses from $A$, yet only some of those may generalise to $B$. How can one know which are likely to generalise? One strategy is to choose the shortest, equating the ability to compress information with the ability to generalise (a proxy for intelligence). We examine this in the context of a mathematical formalism of enactive cognition. We show that compression is neither necessary nor sufficient to maximise performance (measured in terms of the probability of a hypothesis generalising). We formulate a proxy unrelated to length or simplicity, called weakness. We show that if tasks are uniformly distributed, then there is no choice of proxy that performs at least as well as weakness maximisation in all tasks while performing strictly better in at least one. In experiments comparing maximum weakness and minimum description length in the context of binary arithmetic, the former generalised at between $1.1$ and $5$ times the rate of the latter. We argue this demonstrates that weakness is a far better proxy, and explains why Deepmind's Apperception Engine is able to generalise effectively.
Related papers
- Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss [33.18537822803389]
We show that whenever the topologies of $L2$ and $Psi_p$ are comparable on our hypothesis class $mathscrF$, $mathscrF$ is a weakly sub-Gaussian class.
Our result holds whether the problem is realizable or not and we refer to this as a emphnear mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term.
arXiv Detail & Related papers (2024-02-08T18:57:42Z) - Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance.
We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable.
We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z) - On Computing Probabilistic Explanations for Decision Trees [4.406418914680962]
"sufficient reasons" are a kind of explanation in which given a decision tree $T$ and an instance $x$, one explains the decision $T(x)$.
Our paper settles the computational complexity of $delta$-sufficient-reasons over decision trees.
We identify structural restrictions of decision trees that make the problem tractable, and show how SAT solvers might be able to tackle these problems in practical settings.
arXiv Detail & Related papers (2022-06-30T21:58:31Z) - There is no Accuracy-Interpretability Tradeoff in Reinforcement Learning
for Mazes [64.05903267230467]
Interpretability is an essential building block for trustworthiness in reinforcement learning systems.
We show that in certain cases, one can achieve policy interpretability while maintaining its optimality.
arXiv Detail & Related papers (2022-06-09T04:23:26Z) - A Post-Quantum Associative Memory [5.2178708158547025]
Associative memories are devices storing information that can be fully retrieved given partial disclosure of it.
We examine a toy model of associative memory and the ultimate limitations it is subjected to within the framework of general probabilistic theories.
arXiv Detail & Related papers (2022-01-28T18:10:19Z) - Under-bagging Nearest Neighbors for Imbalanced Classification [63.026765294759876]
We propose an ensemble learning algorithm called textitunder-bagging $k$-NN (textitunder-bagging $k$-NN) for imbalanced classification problems.
arXiv Detail & Related papers (2021-09-01T14:10:38Z) - Agnostic learning with unknown utilities [70.14742836006042]
In many real-world problems, the utility of a decision depends on the underlying context $x$ and decision $y$.
We study this as agnostic learning with unknown utilities.
We show that estimating the utilities of only the sampled points$S$ suffices to learn a decision function which generalizes well.
arXiv Detail & Related papers (2021-04-17T08:22:04Z) - Learning Near Optimal Policies with Low Inherent Bellman Error [115.16037976819331]
We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning.
We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
arXiv Detail & Related papers (2020-02-29T02:02:40Z) - Boosting Simple Learners [45.09968166110557]
We focus on two main questions: (i) Complexity: How many weak hypotheses are needed to produce an accurate hypothesis?
We design a novel boosting algorithm which circumvents a classical lower bound by Freund and Schapire ('95, '12)
We provide an affirmative answer to the second question for well-studied corollary classes, including half-spaces and decision stumps.
arXiv Detail & Related papers (2020-01-31T08:34:56Z) - Does generalization performance of $l^q$ regularization learning depend
on $q$? A negative example [19.945160684285003]
$lq$-regularization has been demonstrated to be an attractive technique in machine learning and statistical modeling.
We show that all $lq$ estimators for $0 infty$ attain similar generalization error bounds.
This finding tentatively reveals that, in some modeling contexts, the choice of $q$ might not have a strong impact in terms of the generalization capability.
arXiv Detail & Related papers (2013-07-25T00:48:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.