Related papers: Increasing Trust in Language Models through the Reuse of Verified Circuits

Increasing Trust in Language Models through the Reuse of Verified Circuits

URL: http://arxiv.org/abs/2402.02619v8
Date: Fri, 12 Jul 2024 00:34:01 GMT
Title: Increasing Trust in Language Models through the Reuse of Verified Circuits
Authors: Philip Quirke, Clement Neo, Fazl Barez,
Abstract summary: Language Models (LMs) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases. We show that a model can be trained to meet this standard if built using mathematically and logically specified frameworks. We find extensive reuse of the addition circuits for both tasks, easing verification of the more complex subtractor model.
Score: 1.8434042562191815
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Language Models (LMs) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. Here, we define a stringent standard of trustworthiness whereby the task algorithm and circuit implementation must be verified, accounting for edge cases, with no known failure modes. We show that a model can be trained to meet this standard if built using mathematically and logically specified frameworks. In this paper, we fully verify an auto-regressive transformer model for n-digit integer addition. To exhibit the reusability of verified modules, we insert the trained integer addition model into a larger untrained model and train the combined model to perform both addition and subtraction. We find extensive reuse of the addition circuits for both tasks, easing verification of the more complex subtractor model. We discuss how inserting verified task modules into LMs can leverage model reuse to improve verifiability and trustworthiness of language models built using them. The reuse of verified circuits reduces the effort to verify more complex composite models which we believe to be a significant step towards safety of language models.

Related papers

Understanding In-context Learning of Addition via Activation Subspaces [74.8874431046062]
We study a family of few-shot learning tasks for which the true prediction rule is to add an integer $k$ to the input.<n>We find that Llama-3-8B attains high accuracy on this task for a range of $k$, and localizes its few-shot ability to just three attention heads.
arXiv Detail & Related papers (2025-05-08T11:32:46Z)
Towards a unified and verified understanding of group-operation networks [0.8305049591788082]
We investigate the internals of one-hidden-layer neural networks trained on the binary operation of finite groups. We produce a more complete description of such models in a step towards unifying the explanations of previous works.
arXiv Detail & Related papers (2024-10-09T23:02:00Z)
EmbedLLM: Learning Compact Representations of Large Language Models [28.49433308281983]
We propose EmbedLLM, a framework designed to learn compact vector representations of Large Language Models. We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency.
arXiv Detail & Related papers (2024-10-03T05:43:24Z)
Code Pretraining Improves Entity Tracking Abilities of Language Models [20.6768931196215]
We find clear evidence that models additionally trained on large amounts of code outperform the base models. On the other hand, we find no consistent benefit of additional math training or alignment tuning across various model families.
arXiv Detail & Related papers (2024-05-31T17:56:33Z)
Understanding Addition in Transformers [2.07180164747172]
This paper provides a comprehensive analysis of a one-layer Transformer model trained to perform n-digit integer addition. Our findings suggest that the model dissects the task into parallel streams dedicated to individual digits, employing varied algorithms tailored to different positions within the digits.
arXiv Detail & Related papers (2023-10-19T19:34:42Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Adapting Large Language Models for Content Moderation: Pitfalls in Data Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains. In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z)
AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging) It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data. Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z)
Understanding Parameter Sharing in Transformers [53.75988363281843]
Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth. We show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity. Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.
arXiv Detail & Related papers (2023-06-15T10:48:59Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
Modular Deep Learning [120.36599591042908]
Transfer learning has recently become the dominant paradigm of machine learning. It remains unclear how to develop models that specialise towards multiple tasks without incurring negative interference. Modular deep learning has emerged as a promising solution to these challenges.
arXiv Detail & Related papers (2023-02-22T18:11:25Z)
Interpretable models for extrapolation in scientific machine learning [0.0]
Complex machine learning algorithms often outperform simple regressions in interpolative settings. We examine the trade-off between model performance and interpretability across a broad range of science and engineering problems.
arXiv Detail & Related papers (2022-12-16T19:33:28Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Generalization Analysis on Learning with a Concurrent Verifier [16.298786827265673]
We analyze how the learnability of a machine learning model changes with a CV. We show that typical error bounds based on Rademacher complexity will be no larger than that of the original model.
arXiv Detail & Related papers (2022-10-11T10:51:55Z)
Inter-model Interpretability: Self-supervised Models as a Case Study [0.2578242050187029]
We build on a recent interpretability technique called Dissect to introduce textitinter-model interpretability We project 13 top-performing self-supervised models into a Learned Concepts Embedding space that reveals proximities among models from the perspective of learned concepts. The experiment allowed us to categorize the models into three categories and revealed for the first time the type of visual concepts different tasks requires.
arXiv Detail & Related papers (2022-07-24T22:50:18Z)
Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework. For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement. For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z)
QuantifyML: How Good is my Machine Learning Model? [0.0]
QuantifyML aims to quantify the extent to which machine learning models have learned and generalized from the given data. The formula is analyzed with off-the-shelf model counters to obtain precise counts with respect to different model behavior.
arXiv Detail & Related papers (2021-10-25T01:56:01Z)
STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data. Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z)
Model-agnostic multi-objective approach for the evolutionary discovery of mathematical models [55.41644538483948]
In modern data science, it is more interesting to understand the properties of the model, which parts could be replaced to obtain better results. We use multi-objective evolutionary optimization for composite data-driven model learning to obtain the algorithm's desired properties.
arXiv Detail & Related papers (2021-07-07T11:17:09Z)
Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors. We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method. Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z)
Deducing neighborhoods of classes from a fitted model [68.8204255655161]
In this article a new kind of interpretable machine learning method is presented. It can help to understand the partitioning of the feature space into predicted classes in a classification model using quantile shifts. Basically, real data points (or specific points of interest) are used and the changes of the prediction after slightly raising or decreasing specific features are observed.
arXiv Detail & Related papers (2020-09-11T16:35:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.