Model as Loss: A Self-Consistent Training Paradigm
- URL: http://arxiv.org/abs/2505.21156v1
- Date: Tue, 27 May 2025 13:12:45 GMT
- Title: Model as Loss: A Self-Consistent Training Paradigm
- Authors: Saisamarth Rajesh Phaye, Milos Cernak, Andrew Harper,
- Abstract summary: We propose Model as Loss, a novel training paradigm that utilizes the encoder from the same model as a loss function to guide the training.<n>By using the encoder's learned features as a loss function, this framework enforces self-consistency between the clean reference speech and the enhanced model output.<n>Our approach outperforms pre-trained deep feature losses on standard speech enhancement benchmarks.
- Score: 8.694495827728101
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional methods for speech enhancement rely on handcrafted loss functions (e.g., time or frequency domain losses) or deep feature losses (e.g., using WavLM or wav2vec), which often fail to capture subtle signal properties essential for optimal performance. To address this, we propose Model as Loss, a novel training paradigm that utilizes the encoder from the same model as a loss function to guide the training. The Model as Loss paradigm leverages the encoder's task-specific feature space, optimizing the decoder to produce output consistent with perceptual and task-relevant characteristics of the clean signal. By using the encoder's learned features as a loss function, this framework enforces self-consistency between the clean reference speech and the enhanced model output. Our approach outperforms pre-trained deep feature losses on standard speech enhancement benchmarks, offering better perceptual quality and robust generalization to both in-domain and out-of-domain datasets.
Related papers
- Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine [16.046905753937384]
This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model.<n>Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low codecs (i.e., less than 200 bps) with a minimal loss of the downstream model performance.
arXiv Detail & Related papers (2025-07-17T00:32:07Z) - You Only Train Once [11.97836331714694]
You Only Train Once (YOTO) contributes to limiting training to one shot for the latter aspect of losses selection and weighting.<n>We leverage the differentiability of the composite loss formulation which is widely used for optimizing multiple empirical losses simultaneously.<n>We show that YOTO consistently outperforms the best grid-search model on unseen test data.
arXiv Detail & Related papers (2025-06-04T18:04:58Z) - PAGE: Parametric Generative Explainer for Graph Neural Network [16.350208494261913]
PAGE is capable of providing faithful explanations for any graph neural network without necessitating prior knowledge or internal details.
We introduce an additional discriminator to capture the causality between latent causal features and the model's output.
Compared to existing methods, PAGE operates at the sample scale rather than nodes or edges.
arXiv Detail & Related papers (2024-08-26T06:39:49Z) - Lai Loss: A Novel Loss for Gradient Control [0.0]
"Lai loss" is a novel loss design that integrates the regularization terms (specifically, gradients) into the traditional loss function.<n>With this loss, we can effectively control the model's smoothness and sensitivity.
arXiv Detail & Related papers (2024-05-13T16:17:57Z) - High-level Feature Guided Decoding for Semantic Segmentation [54.424062794490254]
We propose to use powerful pre-trained high-level features as guidance (HFG) for the upsampler to produce robust results.
Specifically, the high-level features from the backbone are used to train the class tokens, which are then reused by the upsampler for classification.
To push the upper limit of HFG, we introduce a context augmentation encoder (CAE) that can efficiently and effectively operate on the low-resolution high-level feature.
arXiv Detail & Related papers (2023-03-15T14:23:07Z) - STIP: A SpatioTemporal Information-Preserving and Perception-Augmented
Model for High-Resolution Video Prediction [78.129039340528]
We propose a Stemporal Information-Preserving and Perception-Augmented Model (STIP) to solve the above two problems.
The proposed model aims to preserve thetemporal information for videos during the feature extraction and the state transitions.
Experimental results show that the proposed STIP can predict videos with more satisfactory visual quality compared with a variety of state-of-the-art methods.
arXiv Detail & Related papers (2022-06-09T09:49:04Z) - Reducing Redundancy in the Bottleneck Representation of the Autoencoders [98.78384185493624]
Autoencoders are a type of unsupervised neural networks, which can be used to solve various tasks.
We propose a scheme to explicitly penalize feature redundancies in the bottleneck representation.
We tested our approach across different tasks: dimensionality reduction using three different dataset, image compression using the MNIST dataset, and image denoising using fashion MNIST.
arXiv Detail & Related papers (2022-02-09T18:48:02Z) - Rate Distortion Characteristic Modeling for Neural Image Compression [59.25700168404325]
End-to-end optimization capability offers neural image compression (NIC) superior lossy compression performance.
distinct models are required to be trained to reach different points in the rate-distortion (R-D) space.
We make efforts to formulate the essential mathematical functions to describe the R-D behavior of NIC using deep network and statistical modeling.
arXiv Detail & Related papers (2021-06-24T12:23:05Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z) - Improved Natural Language Generation via Loss Truncation [29.676561106319173]
We show that distinguishability serves as a principled and robust alternative for handling invalid references.
We propose loss truncation, which adaptively removes high loss examples during training.
We show this is as easy to optimize as log loss and tightly bounds distinguishability under noise.
arXiv Detail & Related papers (2020-04-30T05:31:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.