Composed Fine-Tuning: Freezing Pre-Trained Denoising Autoencoders for
Improved Generalization
- URL: http://arxiv.org/abs/2006.16205v4
- Date: Tue, 24 Oct 2023 23:44:37 GMT
- Title: Composed Fine-Tuning: Freezing Pre-Trained Denoising Autoencoders for
Improved Generalization
- Authors: Sang Michael Xie, Tengyu Ma, Percy Liang
- Abstract summary: We focus on prediction problems with structured outputs subject to output validity constraints.
We propose composed fine-tuning, which fine-tunes a predictor composed with the pre-trained denoiser.
For two-layer ReLU networks, we prove that composed fine-tuning significantly reduces the complexity of the predictor.
- Score: 93.95299500688286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We focus on prediction problems with structured outputs that are subject to
output validity constraints, e.g. pseudocode-to-code translation where the code
must compile. While labeled input-output pairs are expensive to obtain,
"unlabeled" outputs, i.e. outputs without corresponding inputs, are freely
available (e.g. code on GitHub) and provide information about output validity.
We can capture the output structure by pre-training a denoiser to denoise
corrupted versions of unlabeled outputs. We first show that standard
fine-tuning after pre-training destroys some of this structure. We then propose
composed fine-tuning, which fine-tunes a predictor composed with the
pre-trained denoiser, which is frozen to preserve output structure. For
two-layer ReLU networks, we prove that composed fine-tuning significantly
reduces the complexity of the predictor, thus improving generalization.
Empirically, we show that composed fine-tuning improves over standard
fine-tuning on two pseudocode-to-code translation datasets (3% and 6%
relative). The improvement from composed fine-tuning is magnified on
out-of-distribution (OOD) examples (4% and 25% relative).
Related papers
- Bit-flipping Decoder Failure Rate Estimation for (v,w)-regular Codes [84.0257274213152]
We propose a new technique to provide accurate estimates of the DFR of a two-iterations (parallel) bit flipping decoder.
We validate our results, providing comparisons of the modeled and simulated weight of the syndrome, incorrectly-guessed error bit distribution at the end of the first iteration, and two-itcrypteration Decoding Failure Rates (DFR)
arXiv Detail & Related papers (2024-01-30T11:40:24Z) - Diffusion-Based Speech Enhancement with Joint Generative and Predictive
Decoders [38.78712921188612]
We propose a unified system that use jointly generative and predictive decoders across two levels.
Experiments conducted on the Voice-Bank dataset demonstrate that incorporating predictive information leads to faster decoding and higher PESQ scores.
arXiv Detail & Related papers (2023-05-18T06:10:49Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - Few-shot Mining of Naturally Occurring Inputs and Outputs [83.3871936721431]
We mine input output examples from large corpora using a supervised mining function trained using a small seed set of only 100 examples.
Unlike model-generated data augmentation, our method mines naturally occurring high-quality input output pairs to mimic the style of the seed set for multiple tasks.
On SQuAD-style reading comprehension, augmenting the seed set with the mined data results in an improvement of 13 F1 over a BART-large baseline fine-tuned only on the seed set.
arXiv Detail & Related papers (2022-05-09T05:40:52Z) - Recursive Decoding: A Situated Cognition Approach to Compositional
Generation in Grounded Language Understanding [0.0]
We present Recursive Decoding, a novel procedure for training and using seq2seq models.
Rather than generating an entire output sequence in one pass, models are trained to predict one token at a time.
RD yields dramatic improvement on two previously neglected generalization tasks in gSCAN.
arXiv Detail & Related papers (2022-01-27T19:13:42Z) - Sparse Coding with Multi-Layer Decoders using Variance Regularization [19.8572592390623]
We propose a novel sparse coding protocol which prevents a collapse in the codes without the need to regularize the decoder.
Our method regularizes the codes directly so that each latent code component has variance greater than a fixed threshold.
We show that sparse autoencoders with multi-layer decoders trained using our variance regularization method produce higher quality reconstructions with sparser representations.
arXiv Detail & Related papers (2021-12-16T21:46:23Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z) - On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder.
The gates are regularized using the expected value of the sparsity-inducing L0penalty.
We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z) - Learning the Relation between Code Features and Code Transforms with
Structured Prediction [13.62633524166298]
We present the first approach for structurally predicting code transforms at the level of AST nodes using conditional random fields (CRFs)
Our approach first learns offline a probabilistic model that captures how certain code transforms are applied to certain AST nodes, and then uses the learned model to predict transforms for arbitrary new, unseen code snippets.
arXiv Detail & Related papers (2019-07-22T12:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.