Related papers: How does My Model Fail? Automatic Identification and Interpretation of Physical Plausibility Failure Modes with Matryoshka Transcoders

How does My Model Fail? Automatic Identification and Interpretation of Physical Plausibility Failure Modes with Matryoshka Transcoders

URL: http://arxiv.org/abs/2511.10094v2
Date: Tue, 18 Nov 2025 07:49:46 GMT
Title: How does My Model Fail? Automatic Identification and Interpretation of Physical Plausibility Failure Modes with Matryoshka Transcoders
Authors: Yiming Tang, Abhijeet Sinha, Dianbo Liu,
Abstract summary: We introduce Matryoshka Transcoders, a novel framework for the automatic discovery and interpretation of physical plausibility features in generative models.<n>Our approach extends the Matryoshka representation learning paradigm to transcoder architectures, enabling hierarchical sparse feature learning at multiple levels.<n>We utilize the discovered visual patterns to establish a benchmark for evaluating physical plausibility in generative models.
Score: 4.707391886736485
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although recent generative models are remarkably capable of producing instruction-following and realistic outputs, they remain prone to notable physical plausibility failures. Though critical in applications, these physical plausibility errors often escape detection by existing evaluation methods. Furthermore, no framework exists for automatically identifying and interpreting specific physical error patterns in natural language, preventing targeted model improvements. We introduce Matryoshka Transcoders, a novel framework for the automatic discovery and interpretation of physical plausibility features in generative models. Our approach extends the Matryoshka representation learning paradigm to transcoder architectures, enabling hierarchical sparse feature learning at multiple granularity levels. By training on intermediate representations from a physical plausibility classifier and leveraging large multimodal models for interpretation, our method identifies diverse physics-related failure modes without manual feature engineering, achieving superior feature relevance and feature accuracy compared to existing approaches. We utilize the discovered visual patterns to establish a benchmark for evaluating physical plausibility in generative models. Our analysis of eight state-of-the-art generative models provides valuable insights into how these models fail to follow physical constraints, paving the way for further model improvements.

Related papers

Aggregate Models, Not Explanations: Improving Feature Importance Estimation [29.82699646128964]
We show that ensembling at the model level provides more accurate variable-importance estimates.<n>We validate these findings on classical benchmarks and a large-scale proteomic study from the UK Biobank.
arXiv Detail & Related papers (2026-02-12T09:36:03Z)
Attesting Model Lineage by Consisted Knowledge Evolution with Fine-Tuning Trajectory [11.799433126257375]
The fine-tuning technique in deep learning gives rise to an emerging lineage relationship among models.<n>This lineage provides a promising perspective for addressing security concerns such as unauthorized model redistribution and false claim of model provenance.<n>Existing approaches to model lineage detection primarily rely on static architectural similarities.
arXiv Detail & Related papers (2026-01-16T08:56:13Z)
Are vision language models robust to uncertain inputs? [5.249651874118556]
We show that newer and larger vision language models exhibit improved robustness compared to earlier models, but still suffer from a tendency to strictly follow instructions.<n>For natural images such as ImageNet, this limitation can be overcome without pipeline modifications.<n>We propose a novel mechanism based on caption diversity to reveal a model's internal uncertainty.
arXiv Detail & Related papers (2025-05-17T03:16:49Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z)
Exploring the Trade-off Between Model Performance and Explanation Plausibility of Text Classifiers Using Human Rationales [3.242050660144211]
Saliency post-hoc explainability methods are important tools for understanding increasingly complex NLP models. We present a methodology for incorporating rationales, which are text annotations explaining human decisions, into text classification models.
arXiv Detail & Related papers (2024-04-03T22:39:33Z)
Surrogate Modeling for Physical Systems with Preserved Properties and Adjustable Tradeoffs [0.0]
We present a model-based and a data-driven strategy to generate surrogate models. The latter generates interpretable surrogate models by fitting artificial relations to a presupposed topological structure. Our framework is compatible with various spatial discretization schemes for distributed parameter models.
arXiv Detail & Related papers (2022-02-02T17:07:02Z)
Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction. We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss. Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z)
Physics-Integrated Variational Autoencoders for Robust and Interpretable Generative Modeling [86.9726984929758]
We focus on the integration of incomplete physics models into deep generative models. We propose a VAE architecture in which a part of the latent space is grounded by physics. We demonstrate generative performance improvements over a set of synthetic and real-world datasets.
arXiv Detail & Related papers (2021-02-25T20:28:52Z)
DirectDebug: Automated Testing and Debugging of Feature Models [55.41644538483948]
Variability models (e.g., feature models) are a common way for the representation of variabilities and commonalities of software artifacts. Complex and often large-scale feature models can become faulty, i.e., do not represent the expected variability properties of the underlying software artifact.
arXiv Detail & Related papers (2021-02-11T11:22:20Z)
Goal-directed Generation of Discrete Structures with Conditional Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward. We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.