Related papers: Consistent Accelerated Inference via Confident Adaptive Transformers

Consistent Accelerated Inference via Confident Adaptive Transformers

URL: http://arxiv.org/abs/2104.08803v1
Date: Sun, 18 Apr 2021 10:22:28 GMT
Title: Consistent Accelerated Inference via Confident Adaptive Transformers
Authors: Tal Schuster, Adam Fisch, Tommi Jaakkola, Regina Barzilay
Abstract summary: We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers. We simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. We demonstrate the effectiveness of this approach on four classification and regression tasks.
Score: 29.034390810078172
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase efficiency, but can come with unpredictable performance costs. In this work, we present CATs -- Confident Adaptive Transformers -- in which we simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. Our method trains additional prediction heads on top of intermediate layers, and dynamically decides when to stop allocating computational effort to each input using a meta consistency classifier. To calibrate our early prediction stopping rule, we formulate a unique extension of conformal prediction. We demonstrate the effectiveness of this approach on four classification and regression tasks.

Related papers

A Constrained Optimization Perspective of Unrolled Transformers [77.12297732942095]
We introduce a constrained optimization framework for training transformers that behave like optimization descent algorithms.<n>We observe constrained transformers achieve stronger to perturbations robustness and maintain higher out-of-distribution generalization.
arXiv Detail & Related papers (2026-01-24T02:12:39Z)
Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium [0.6820746164515952]
We introduce the closed-loop prediction principle, which requires that models iteratively refine latent representations until reaching a self-consistent equilibrium.<n>We instantiate this principle as Equilibrium Transformers, which augment standard transformer layers with an Equilibrium Refinement Module.<n>Preliminary experiments on the binary parity task demonstrate +3.28% average improvement on challenging sequences, with gains reaching +8.07% where standard transformers approach random performance.
arXiv Detail & Related papers (2025-11-26T20:02:59Z)
Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models [64.92045568376705]
Coherent Contextual Decoding (CCD) is a novel inference framework built upon two core innovations.<n>CCD employs a trajectory rectification mechanism that leverages historical context to enhance sequence coherence.<n>Instead of rigid allocations based on diffusion steps, we introduce an adaptive sampling strategy that dynamically adjusts the unmasking budget for each step.
arXiv Detail & Related papers (2025-11-26T09:49:48Z)
On the Role of Surrogates in Conformal Inference of Individual Causal Effects [0.0]
We introduce underlineSurrogate-assisted underlineConformal underlineInference for underlineEfficient IunderlineNdividual underlineCausal underlineEffects (SCIENCE) SCIENCE is a framework designed to construct more efficient prediction intervals for individual treatment effects (ITEs) It is applied to the phase 3 Moderna COVE COVID-19 vaccine trial.
arXiv Detail & Related papers (2024-12-16T21:36:11Z)
Continual Low-Rank Scaled Dot-product Attention [67.11704350478475]
We introduce a new formulation of the Scaled-product Attention based on the Nystr"om approximation that is suitable for Continual Inference. In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude.
arXiv Detail & Related papers (2024-12-04T11:05:01Z)
Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks. We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
Fourier Test-time Adaptation with Multi-level Consistency for Robust Classification [10.291631977766672]
We propose a novel approach called Fourier Test-time Adaptation (FTTA) to integrate input and model tuning. FTTA builds a reliable multi-level consistency measurement of paired inputs for achieving self-supervised of prediction. It was extensively validated on three large classification datasets with different modalities and organs.
arXiv Detail & Related papers (2023-06-05T02:29:38Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)
Efficient and Differentiable Conformal Prediction with General Function Classes [96.74055810115456]
We propose a generalization of conformal prediction to multiple learnable parameters. We show that it achieves approximate valid population coverage and near-optimal efficiency within class. Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly.
arXiv Detail & Related papers (2022-02-22T18:37:23Z)
Transformer Uncertainty Estimation with Hierarchical Stochastic Attention [8.95459272947319]
We propose a novel way to enable transformers to have the capability of uncertainty estimation. This is achieved by learning a hierarchical self-attention that attends to values and a set of learnable centroids. We empirically evaluate our model on two text classification tasks with both in-domain (ID) and out-of-domain (OOD) datasets.
arXiv Detail & Related papers (2021-12-27T16:43:31Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
Pretrained Transformers as Universal Computation Engines [105.00539596788127]
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning. We study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. We find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
arXiv Detail & Related papers (2021-03-09T06:39:56Z)
BERT Loses Patience: Fast and Robust Inference with Early Exit [91.26199404912019]
We propose Patience-based Early Exit as a plug-and-play technique to improve the efficiency and robustness of a pretrained language model. Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers.
arXiv Detail & Related papers (2020-06-07T13:38:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.