AB/BA analysis: A framework for estimating keyword spotting recall
improvement while maintaining audio privacy
- URL: http://arxiv.org/abs/2204.08474v1
- Date: Mon, 18 Apr 2022 13:52:22 GMT
- Title: AB/BA analysis: A framework for estimating keyword spotting recall
improvement while maintaining audio privacy
- Authors: Raphael Petegrosso, Vasistakrishna Baderdinni, Thibaud Senechal,
Benjamin L. Bullough
- Abstract summary: KWS is designed to only collect data when the keyword is present, limiting the availability of hard samples that may contain false negatives.
We propose an evaluation technique which we call AB/BA analysis.
We show that AB/BA analysis is successful at measuring recall improvement in conjunction with the trade-off in relative false positive rate.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluation of keyword spotting (KWS) systems that detect keywords in speech
is a challenging task under realistic privacy constraints. The KWS is designed
to only collect data when the keyword is present, limiting the availability of
hard samples that may contain false negatives, and preventing direct estimation
of model recall from production data. Alternatively, complementary data
collected from other sources may not be fully representative of the real
application. In this work, we propose an evaluation technique which we call
AB/BA analysis. Our framework evaluates a candidate KWS model B against a
baseline model A, using cross-dataset offline decoding for relative recall
estimation, without requiring negative examples. Moreover, we propose a
formulation with assumptions that allow estimation of relative false positive
rate between models with low variance even when the number of false positives
is small. Finally, we propose to leverage machine-generated soft labels, in a
technique we call Semi-Supervised AB/BA analysis, that improves the analysis
time, privacy, and cost. Experiments with both simulation and real data show
that AB/BA analysis is successful at measuring recall improvement in
conjunction with the trade-off in relative false positive rate.
Related papers
- Low-Cost High-Power Membership Inference Attacks [15.240271537329534]
Membership inference attacks aim to detect if a particular data point was used in training a model.
We design a novel statistical test to perform robust membership inference attacks with low computational overhead.
RMIA lays the groundwork for practical yet accurate data privacy risk assessment in machine learning.
arXiv Detail & Related papers (2023-12-06T03:18:49Z) - Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration [32.15773300068426]
Membership Inference Attacks (MIAs) aim to infer whether a target data record has been utilized for model training or not.
We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA)
Specifically, since memorization in LLMs is inevitable during the training process and occurs before overfitting, we introduce a more reliable membership signal.
arXiv Detail & Related papers (2023-11-10T13:55:05Z) - Detecting Pretraining Data from Large Language Models [90.12037980837738]
We study the pretraining data detection problem.
Given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text?
We introduce a new detection method Min-K% Prob based on a simple hypothesis.
arXiv Detail & Related papers (2023-10-25T17:21:23Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Unsupervised Dense Retrieval with Relevance-Aware Contrastive
Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning.
We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks.
Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z) - A Unified Evaluation of Textual Backdoor Learning: Frameworks and
Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning.
We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z) - Sparse Feature Selection Makes Batch Reinforcement Learning More Sample
Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation.
When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z) - The Gap on GAP: Tackling the Problem of Differing Data Distributions in
Bias-Measuring Datasets [58.53269361115974]
Diagnostic datasets that can detect biased models are an important prerequisite for bias reduction within natural language processing.
undesired patterns in the collected data can make such tests incorrect.
We introduce a theoretically grounded method for weighting test samples to cope with such patterns in the test data.
arXiv Detail & Related papers (2020-11-03T16:50:13Z) - Improving Dialog Evaluation with a Multi-reference Adversarial Dataset
and Large Scale Pretraining [18.174086416883412]
We introduce the DailyDialog++ dataset, consisting of (i) five relevant responses for each context and (ii) five adversarially crafted irrelevant responses for each context.
We show that even in the presence of multiple correct references, n-gram based metrics and embedding based metrics do not perform well at separating relevant responses from even random negatives.
We propose a new BERT-based evaluation metric called DEB, which is pretrained on 727M Reddit conversations and then finetuned on our dataset.
arXiv Detail & Related papers (2020-09-23T18:06:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.