Phase Transitions for the Information Bottleneck in Representation
Learning
- URL: http://arxiv.org/abs/2001.01878v1
- Date: Tue, 7 Jan 2020 03:55:32 GMT
- Title: Phase Transitions for the Information Bottleneck in Representation
Learning
- Authors: Tailin Wu and Ian Fischer
- Abstract summary: In the Information Bottleneck (IB), when tuning the relative strength between compression and prediction terms, how do the two terms behave, and what's their relationship with the dataset and the learned representation?
We introduce a definition for IB phase transitions as a qualitative change of the IB loss landscape, and show that the transitions correspond to the onset of learning new classes.
Using second-order calculus of variations, we derive a formula that provides a practical condition for IB phase transitions, and draw its connection with the Fisher information matrix for parameterized models.
- Score: 14.381429281068565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the Information Bottleneck (IB), when tuning the relative strength between
compression and prediction terms, how do the two terms behave, and what's their
relationship with the dataset and the learned representation? In this paper, we
set out to answer these questions by studying multiple phase transitions in the
IB objective: $\text{IB}_\beta[p(z|x)] = I(X; Z) - \beta I(Y; Z)$ defined on
the encoding distribution p(z|x) for input $X$, target $Y$ and representation
$Z$, where sudden jumps of $dI(Y; Z)/d \beta$ and prediction accuracy are
observed with increasing $\beta$. We introduce a definition for IB phase
transitions as a qualitative change of the IB loss landscape, and show that the
transitions correspond to the onset of learning new classes. Using second-order
calculus of variations, we derive a formula that provides a practical condition
for IB phase transitions, and draw its connection with the Fisher information
matrix for parameterized models. We provide two perspectives to understand the
formula, revealing that each IB phase transition is finding a component of
maximum (nonlinear) correlation between $X$ and $Y$ orthogonal to the learned
representation, in close analogy with canonical-correlation analysis (CCA) in
linear settings. Based on the theory, we present an algorithm for discovering
phase transition points. Finally, we verify that our theory and algorithm
accurately predict phase transitions in categorical datasets, predict the onset
of learning new classes and class difficulty in MNIST, and predict prominent
phase transitions in CIFAR10.
Related papers
- Phase-space entropy at acquisition reflects downstream learnability [54.4100065023873]
We propose an acquisition-level scalar $S_mathcal B$ based on instrument-resolved phase space.<n>We show theoretically that (S_mathcal B) correctly identifies the phase-space coherence of periodic sampling.<n>$|S_mathcal B|$ consistently ranks sampling geometries and predicts downstream reconstruction/recognition difficulty emphwithout training.
arXiv Detail & Related papers (2025-12-22T10:03:51Z) - Provable In-Context Learning of Nonlinear Regression with Transformers [66.99048542127768]
In-context learning (ICL) is the ability to perform unseen tasks using task specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL, with much of the focus on relatively simple tasks.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification [7.869708570399577]
We consider a bi-objective prediction task of predicting both the conditional expectation $mathbbE[Y|X]$ and the conditional variance Var$(Y|X)$.
Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution.
arXiv Detail & Related papers (2024-05-24T00:08:55Z) - Perturbative partial moment matching and gradient-flow adaptive importance sampling transformations for Bayesian leave one out cross-validation [0.9895793818721335]
We motivate the use of perturbative transformations of the form $T(boldsymboltheta)=boldsymboltheta + h Q(boldsymboltheta),$ for $0hll 1,$.<n>We derive closed-form expressions in the case of logistic regression and shallow ReLU activated neural networks.
arXiv Detail & Related papers (2024-02-13T01:03:39Z) - Scale-invariant phase transition of disordered bosons in one dimension [0.0]
disorder-induced quantum phase transition between superfluid and non-superfluid states of bosonic particles in one dimension is generally expected to be of the Berezinskii-Kosterlitz-Thouless (BKT) type.
Here, we show that hard-core lattice bosons with integrable power-law hopping decaying with distance as $1/ralpha$ undergo a non-BKT continuous phase transition instead.
arXiv Detail & Related papers (2023-10-26T13:30:12Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Towards Faster Non-Asymptotic Convergence for Diffusion-Based Generative
Models [49.81937966106691]
We develop a suite of non-asymptotic theory towards understanding the data generation process of diffusion models.
In contrast to prior works, our theory is developed based on an elementary yet versatile non-asymptotic approach.
arXiv Detail & Related papers (2023-06-15T16:30:08Z) - Towards understanding neural collapse in supervised contrastive learning with the information bottleneck method [26.874007846077884]
Neural collapse describes the geometry of activation in the final layer of a deep neural network when it is trained beyond performance plateaus.
We demonstrate that neural collapse leads to good generalization specifically when it approaches an optimal IB solution of the classification problem.
arXiv Detail & Related papers (2023-05-19T18:41:17Z) - A sharp phase transition in linear cross-entropy benchmarking [1.4841630983274847]
A key question in the theory of XEB is whether it approximates the fidelity of the quantum state preparation.
Previous works have shown that the XEB generically approximates the fidelity in a regime where the noise rate per qudit $varepsilon$ satisfies $varepsilon N ll 1$.
Here, we show that the breakdown of XEB as a fidelity proxy occurs as a sharp phase transition at a critical value of $varepsilon N$.
arXiv Detail & Related papers (2023-05-08T18:00:05Z) - Transformers meet Stochastic Block Models: Attention with Data-Adaptive
Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention.
We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model.
Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z) - A Random Matrix Analysis of Random Fourier Features: Beyond the Gaussian
Kernel, a Precise Phase Transition, and the Corresponding Double Descent [85.77233010209368]
This article characterizes the exacts of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples $n$ is all large and comparable.
This analysis also provides accurate estimates of training and test regression errors for large $n,p,N$.
arXiv Detail & Related papers (2020-06-09T02:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.