Related papers: Model Equality Testing: Which Model Is This API Serving?

Model Equality Testing: Which Model Is This API Serving?

URL: http://arxiv.org/abs/2410.20247v1
Date: Sat, 26 Oct 2024 18:34:53 GMT
Title: Model Equality Testing: Which Model Is This API Serving?
Authors: Irena Gao, Percy Liang, Carlos Guestrin,
Abstract summary: We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem. A test built on a simple string kernel achieves a median of 77.4% power against a range of distortions. We then apply this test to commercial inference APIs for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.
Score: 59.005869726179455
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Users often interact with large language models through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studio). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution -- often without notifying users. We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem, where the user collects samples from the API and a reference distribution and conducts a statistical test to see if the two distributions are the same. We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.

Related papers

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering [51.7496756448709]
Language models (LMs) perform well on coding benchmarks but struggle with real-world software engineering tasks.<n>Existing approaches rely on supervised fine-tuning with high-quality data, which is expensive to curate at scale.<n>We propose Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process.
arXiv Detail & Related papers (2025-05-29T16:15:36Z)
Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial. In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations. We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z)
Adapted-MoE: Mixture of Experts with Test-Time Adaption for Anomaly Detection [10.12283550685127]
We propose an Adapted-MoE to handle multiple distributions of same-category samples by divide and conquer. Specifically, we propose a routing network based on representation learning to route same-category samples into the subclasses feature space. We propose the test-time adaption to eliminate the bias between the unseen test sample representation and the feature distribution learned by the expert model.
arXiv Detail & Related papers (2024-09-09T13:49:09Z)
Perturb-and-Compare Approach for Detecting Out-of-Distribution Samples in Constrained Access Environments [20.554546406575]
We propose an OOD detection framework, MixDiff, that is applicable even when the model's parameters or its activations are not accessible to the end user. We provide theoretical analysis to illustrate MixDiff's effectiveness in discerning OOD samples that induce overconfident outputs from the model.
arXiv Detail & Related papers (2024-08-19T15:51:31Z)
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [81.34900892130929]
We explore inference compute as another axis for scaling by increasing the number of generated samples. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. We find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers.
arXiv Detail & Related papers (2024-07-31T17:57:25Z)
DistPred: A Distribution-Free Probabilistic Inference Method for Regression and Forecasting [14.390842560217743]
We propose a novel approach called DistPred for regression and forecasting tasks. We transform proper scoring rules that measure the discrepancy between the predicted distribution and the target distribution into a differentiable discrete form. This allows the model to sample numerous samples in a single forward pass to estimate the potential distribution of the response variable.
arXiv Detail & Related papers (2024-06-17T10:33:00Z)
Active Sequential Two-Sample Testing [18.99517340397671]
We consider the two-sample testing problem in a new scenario where sample measurements are inexpensive to access. We devise the first emphactiveNIST-sample testing framework that not only sequentially but also emphactively queries. In practice, we introduce an instantiation of our framework and evaluate it using several experiments.
arXiv Detail & Related papers (2023-01-30T02:23:49Z)
Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion Models [54.1843419649895]
We propose a solution based on denoising diffusion probabilistic models (DDPMs) Our motivation for choosing diffusion models over other generative models comes from the flexible internal structure of diffusion models. Our method can unite multiple diffusion models trained on multiple sub-tasks and conquer the combined task.
arXiv Detail & Related papers (2022-12-01T18:59:55Z)
Predicting Out-of-Distribution Error with the Projection Norm [87.61489137914693]
Projection Norm predicts a model's performance on out-of-distribution data without access to ground truth labels. We find that Projection Norm is the only approach that achieves non-trivial detection performance on adversarial examples.
arXiv Detail & Related papers (2022-02-11T18:58:21Z)
SITA: Single Image Test-time Adaptation [48.789568233682296]
In Test-time Adaptation (TTA), given a model trained on some source data, the goal is to adapt it to make better predictions for test instances from a different distribution. We consider TTA in a more pragmatic setting which we refer to as SITA (Single Image Test-time Adaptation) Here, when making each prediction, the model has access only to the given single test instance, rather than a batch of instances. We propose a novel approach AugBN for the SITA setting that requires only forward-preserving propagation.
arXiv Detail & Related papers (2021-12-04T15:01:35Z)
Significance tests of feature relevance for a blackbox learner [6.72450543613463]
We derive two consistent tests for the feature relevance of a blackbox learner. The first evaluates a loss difference with perturbation on an inference sample. The second splits the inference sample into two but does not require data perturbation.
arXiv Detail & Related papers (2021-03-02T00:59:19Z)
Two-Sample Testing on Ranked Preference Data and the Role of Modeling Assumptions [57.77347280992548]
In this paper, we design two-sample tests for pairwise comparison data and ranking data. Our test requires essentially no assumptions on the distributions. By applying our two-sample test on real-world pairwise comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently.
arXiv Detail & Related papers (2020-06-21T20:51:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.