Related papers: Scaling Structured Inference with Randomization

Scaling Structured Inference with Randomization

URL: http://arxiv.org/abs/2112.03638v1
Date: Tue, 7 Dec 2021 11:26:41 GMT
Title: Scaling Structured Inference with Randomization
Authors: Yao Fu and Mirella Lapata
Abstract summary: We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference. It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
Score: 64.18063627155128
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The scale of the state space of discrete graphical models is crucial for model capacity in the era of deep learning. Existing dynamic programming (DP) based inference typically works with a small number of states (usually less than hundreds). In this work, we propose a family of randomized dynamic programming (RDP) algorithms for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference (partition, marginal, reparameterization, entropy, .etc) and different graph structures (chains, trees, and more general hypergraphs). It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly and learned with gradient-based optimizers. Our core technique is randomization, which is to restrict and reweight DP on a small selected subset of nodes, leading to computation reduction by orders of magnitudes. We further achieve low bias and variance with Rao-Blackwellization and importance sampling. Experiments on different inferences over different graphs demonstrate the accuracy and efficiency of our methods. Furthermore, when using RDP to train a scaled structured VAE, it outperforms baselines in terms of test likelihood and successfully prevents posterior collapse.

Related papers

Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized Language Model Finetuning Using Shared Randomness [86.61582747039053]
Language model training in distributed settings is limited by the communication cost of exchanges. We extend recent work using shared randomness to perform distributed fine-tuning with low bandwidth.
arXiv Detail & Related papers (2023-06-16T17:59:51Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Combating Mode Collapse in GANs via Manifold Entropy Estimation [70.06639443446545]
Generative Adversarial Networks (GANs) have shown compelling results in various tasks and applications. We propose a novel training pipeline to address the mode collapse issue of GANs.
arXiv Detail & Related papers (2022-08-25T12:33:31Z)
Probabilistic partition of unity networks: clustering based deep approximation [0.0]
Partition of unity networks (POU-Nets) have been shown capable of realizing algebraic convergence rates for regression and solution of PDEs. We enrich POU-Nets with a Gaussian noise model to obtain a probabilistic generalization amenable to gradient-based generalizations of a maximum likelihood loss. We provide benchmarks quantifying performance in high/low-dimensions, demonstrating that convergence rates depend only on the latent dimension of data within high-dimensional space.
arXiv Detail & Related papers (2021-07-07T08:02:00Z)
A Distributed Optimisation Framework Combining Natural Gradient with Hessian-Free for Discriminative Sequence Training [16.83036203524611]
This paper presents a novel natural gradient and Hessian-free (NGHF) optimisation framework for neural network training. It relies on the linear conjugate gradient (CG) algorithm to combine the natural gradient (NG) method with local curvature information from Hessian-free (HF) or other second-order methods. Experiments are reported on the multi-genre broadcast data set for a range of different acoustic model types.
arXiv Detail & Related papers (2021-03-12T22:18:34Z)
Message Passing Descent for Efficient Machine Learning [4.416484585765027]
We propose a new iterative optimization method for the bf Data-Fitting (DF) problem in Machine Learning. The approach relies on bf Graphical Model representation of the DF problem. We suggest the bf Message Passage Descent algorithm which relies on the piece-wise-polynomial representation of the model DF function.
arXiv Detail & Related papers (2021-02-16T12:22:54Z)
Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning. Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch. ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
Randomized Automatic Differentiation [22.95414996614006]
We develop a general framework and approach for randomized automatic differentiation (RAD) RAD can allow unbiased estimates to be computed with reduced memory in return for variance. We show that RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks.
arXiv Detail & Related papers (2020-07-20T19:03:44Z)
Multipole Graph Neural Operator for Parametric Partial Differential Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data. We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity. Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.