Related papers: Reducing Model Jitter: Stable Re-training of Semantic Parsers in Production Environments

Reducing Model Jitter: Stable Re-training of Semantic Parsers in Production Environments

URL: http://arxiv.org/abs/2204.04735v1
Date: Sun, 10 Apr 2022 17:57:55 GMT
Title: Reducing Model Jitter: Stable Re-training of Semantic Parsers in Production Environments
Authors: Christopher Hidey, Fei Liu, Rahul Goel
Abstract summary: Retraining modern deep learning systems can lead to variations in model performance even when trained using the same data and hyper- parameters. We demonstrate the effectiveness of various jitter reduction techniques such as ensembling and distillation. We show that co-distillation provides a sweet spot in terms of jitter reduction for semantic parsing systems with only a modest increase in resource usage.
Score: 14.829119556960066
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retraining modern deep learning systems can lead to variations in model performance even when trained using the same data and hyper-parameters by simply using different random seeds. We call this phenomenon model jitter. This issue is often exacerbated in production settings, where models are retrained on noisy data. In this work we tackle the problem of stable retraining with a focus on conversational semantic parsers. We first quantify the model jitter problem by introducing the model agreement metric and showing the variation with dataset noise and model sizes. We then demonstrate the effectiveness of various jitter reduction techniques such as ensembling and distillation. Lastly, we discuss practical trade-offs between such techniques and show that co-distillation provides a sweet spot in terms of jitter reduction for semantic parsing systems with only a modest increase in resource usage.

Related papers

Diffusion models under low-noise regime [3.729242965449096]
We show that diffusion models are effective denoisers when the corruption level is small.<n>We quantify how training set size, data geometry, and model objective choice shape denoising trajectories.<n>This work starts to address gaps in our understanding of generative model reliability in practical applications.
arXiv Detail & Related papers (2025-06-09T15:07:16Z)
Time Series Similarity Score Functions to Monitor and Interact with the Training and Denoising Process of a Time Series Diffusion Model applied to a Human Activity Recognition Dataset based on IMUs [0.0]
diffusion probabilistic models are able to generate synthetic sensor signals.<n>The training process is controlled by a loss function which measures the difference between the noise that was added in the forward process and the noise that was predicted by the diffusion model.<n>We examine multiple similarity metrics and adapt an existing metric to overcome this issue by monitoring the training and synthetisation process.
arXiv Detail & Related papers (2025-05-20T06:38:17Z)
Joint Diffusion models in Continual Learning [4.013156524547073]
We introduce JDCL - a new method for continual learning with generative rehearsal based on joint diffusion models. Generative-replay-based continual learning methods try to mitigate this issue by retraining a model with a combination of new and rehearsal data sampled from a generative model. We show that such shared parametrization, combined with the knowledge distillation technique allows for stable adaptation to new tasks without catastrophic forgetting.
arXiv Detail & Related papers (2024-11-12T22:35:44Z)
Self-calibration for Language Model Quantization and Pruning [38.00221764773372]
Quantization and pruning methods require calibration data, a small set of unlabeled examples. In this paper, we propose self-calibration as a solution. Our approach requires no external data, instead leveraging the model itself to generate synthetic calibration data.
arXiv Detail & Related papers (2024-10-22T16:50:00Z)
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction. SMILE allows for the upscaling of source models into an MoE model without extra data or further training. We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z)
Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models. We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z)
One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls [77.42510898755037]
One More Step (OMS) is a compact network that incorporates an additional simple yet effective step during inference. OMS elevates image fidelity and harmonizes the dichotomy between training and inference, while preserving original model parameters. Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.
arXiv Detail & Related papers (2023-11-27T12:02:42Z)
ExposureDiffusion: Learning to Expose for Low-light Image Enhancement [87.08496758469835]
This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks.
arXiv Detail & Related papers (2023-07-15T04:48:35Z)
BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few. We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z)
Empowering Diffusion Models on the Embedding Space for Text Generation [38.664533078347304]
We study the optimization challenges encountered with both the embedding space and the denoising model. Data distribution is learnable for embeddings, which may lead to the collapse of the embedding space and unstable training. Based on the above analysis, we propose Difformer, an embedding diffusion model based on Transformer.
arXiv Detail & Related papers (2022-12-19T12:44:25Z)
Robust Training under Label Noise by Over-parameterization [41.03008228953627]
We propose a principled approach for robust training of over-parameterized deep networks in classification tasks where a proportion of training labels are corrupted. The main idea is yet very simple: label noise is sparse and incoherent with the network learned from clean data, so we model the noise and learn to separate it from the data. Remarkably, when trained using such a simple method in practice, we demonstrate state-of-the-art test accuracy against label noise on a variety of real datasets.
arXiv Detail & Related papers (2022-02-28T18:50:10Z)
Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn. We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z)
Generative Text Modeling through Short Run Inference [47.73892773331617]
The present work proposes a short run dynamics for inference. It is variation from the prior distribution of the latent variable and then runs a small number of Langevin dynamics steps guided by its posterior distribution. We show that the models trained with short run dynamics more accurately model the data, compared to strong language model and VAE baselines, and exhibit no sign of posterior collapse.
arXiv Detail & Related papers (2021-05-27T09:14:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.