VERBA: Verbalizing Model Differences Using Large Language Models
- URL: http://arxiv.org/abs/2507.02241v1
- Date: Thu, 03 Jul 2025 02:25:24 GMT
- Title: VERBA: Verbalizing Model Differences Using Large Language Models
- Authors: Shravan Doda, Shashidhar Reddy Javaji, Zining Zhu,
- Abstract summary: We introduce $textbfVERBA$ to generate verbalizations of model differences by sampling from two models.<n>For a pair of decision tree models with up to 5% performance difference but 20-25% behavioral differences, $textbfVERBA$ effectively verbalizes their variations with up to 80% overall accuracy.
- Score: 3.2771631221674333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the current machine learning landscape, we face a "model lake" phenomenon: Given a task, there is a proliferation of trained models with similar performances despite different behavior. For model users attempting to navigate and select from the models, documentation comparing model pairs is helpful. However, for every $N$ models there could be $O(N^2)$ pairwise comparisons, a number prohibitive for the model developers to manually perform pairwise comparisons and prepare documentations. To facilitate fine-grained pairwise comparisons among models, we introduced $\textbf{VERBA}$. Our approach leverages a large language model (LLM) to generate verbalizations of model differences by sampling from the two models. We established a protocol that evaluates the informativeness of the verbalizations via simulation. We also assembled a suite with a diverse set of commonly used machine learning models as a benchmark. For a pair of decision tree models with up to 5% performance difference but 20-25% behavioral differences, $\textbf{VERBA}$ effectively verbalizes their variations with up to 80% overall accuracy. When we included the models' structural information, the verbalization's accuracy further improved to 90%. $\textbf{VERBA}$ opens up new research avenues for improving the transparency and comparability of machine learning models in a post-hoc manner.
Related papers
- ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models.
We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design.
Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z) - Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models [22.425339110551743]
We introduce $textitweak-to-strong search, framing the alignment of a large language model as a test-time greedy search.
In controlled-sentiment generation and summarization, we use tuned and untuned $textttgpt2$s to improve the alignment of large models without additional training.
In a more difficult instruction-following benchmark, we show that reusing off-the-shelf small models can improve the length-controlled win rates of both white-box and black-box large models.
arXiv Detail & Related papers (2024-05-29T16:55:32Z) - Perturbed examples reveal invariances shared by language models [8.04604449335578]
We introduce a novel framework to compare two NLP models.
Via experiments on models from the same and different architecture families, this framework offers insights about how changes in models affect linguistic capabilities.
arXiv Detail & Related papers (2023-11-07T17:48:35Z) - Automatic Model Selection with Large Language Models for Reasoning [33.93807127935167]
Chain-of-Thought (CoT) and Program-Aided Language Models (PAL) represent two distinct reasoning methods.
We introduce a model selection method to combine the best of both worlds by employing a large language model.
Our proposed method demonstrates significant performance improvements across eight reasoning datasets.
arXiv Detail & Related papers (2023-05-23T17:57:59Z) - Contrastive Alignment of Vision to Language Through Parameter-Efficient
Transfer Learning [60.26952378997713]
Contrastive vision-language models (e.g. CLIP) are created by updating all the parameters of a vision model and language model through contrastive training.
We show that a minimal set of parameter updates ($$7%) can achieve the same performance as full-model training.
We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training.
arXiv Detail & Related papers (2023-03-21T14:12:08Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [47.432215933099016]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.<n>This creates a barrier to fusing knowledge across individual models to yield a better single model.<n>We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - CAMERO: Consistency Regularized Ensemble of Perturbed Language Models
with Weight Sharing [83.63107444454938]
We propose a consistency-regularized ensemble learning approach based on perturbed models, named CAMERO.
Specifically, we share the weights of bottom layers across all models and apply different perturbations to the hidden representations for different models, which can effectively promote the model diversity.
Our experiments using large language models demonstrate that CAMERO significantly improves the generalization performance of the ensemble model.
arXiv Detail & Related papers (2022-04-13T19:54:51Z) - Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation.
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z) - ModelDiff: Testing-Based DNN Similarity Comparison for Model Reuse
Detection [9.106864924968251]
ModelDiff is a testing-based approach to deep learning model similarity comparison.
A study on mobile deep learning apps has shown the feasibility of ModelDiff on real-world models.
arXiv Detail & Related papers (2021-06-11T15:16:18Z) - What do we expect from Multiple-choice QA Systems? [70.86513724662302]
We consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets.
We evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs.
arXiv Detail & Related papers (2020-11-20T21:27:10Z) - Introducing various Semantic Models for Amharic: Experimentation and
Evaluation with multiple Tasks and Datasets [19.855120632909124]
We introduce different semantic models for Amharic.
Models are build using word2Vec embeddings, distributional thesaurus (DT), contextual embeddings, and DT embeddings.
We find that newly trained models perform better than pre-trained multilingual models.
arXiv Detail & Related papers (2020-11-02T17:48:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.