Related papers: Detecting Unintended Memorization in Language-Model-Fused ASR

Detecting Unintended Memorization in Language-Model-Fused ASR

URL: http://arxiv.org/abs/2204.09606v1
Date: Wed, 20 Apr 2022 16:35:13 GMT
Title: Detecting Unintended Memorization in Language-Model-Fused ASR
Authors: W. Ronny Huang, Steve Chien, Om Thakkar, Rajiv Mathews
Abstract summary: We propose a framework for detecting memorization of random textual sequences (which we call canaries) in the LM training data. On a production-grade Conformer RNN-T E2E model fused with a Transformer LM, we show that detecting memorization of canaries from the LM training data of 300M examples is possible. Motivated to protect privacy, we also show that such memorization gets significantly reduced by per-example gradient-clipped LM training.
Score: 10.079200692649462
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: End-to-end (E2E) models are often being accompanied by language models (LMs) via shallow fusion for boosting their overall quality as well as recognition of rare words. At the same time, several prior works show that LMs are susceptible to unintentionally memorizing rare or unique sequences in the training data. In this work, we design a framework for detecting memorization of random textual sequences (which we call canaries) in the LM training data when one has only black-box (query) access to LM-fused speech recognizer, as opposed to direct access to the LM. On a production-grade Conformer RNN-T E2E model fused with a Transformer LM, we show that detecting memorization of singly-occurring canaries from the LM training data of 300M examples is possible. Motivated to protect privacy, we also show that such memorization gets significantly reduced by per-example gradient-clipped LM training without compromising overall quality.

Related papers

Memorization or Interpolation ? Detecting LLM Memorization through Input Perturbation Analysis [8.725781605542675]
Large Language Models (LLMs) achieve remarkable performance through training on massive datasets.<n>LLMs can exhibit concerning behaviors such as verbatim reproduction of training data rather than true generalization.<n>This paper introduces PEARL, a novel approach for detecting memorization in LLMs.
arXiv Detail & Related papers (2025-05-05T20:42:34Z)
MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems [8.971049629873185]
MTLM is a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives.<n>It supports multiple decoding strategies, including shallow fusion, unidirectional/bidirectional n-best rescoring.<n>Experiments on the LibriSpeech dataset show that MTLM consistently outperforms unidirectional training across multiple decoding strategies.
arXiv Detail & Related papers (2025-02-14T10:21:10Z)
Demystifying Verbatim Memorization in Large Language Models [67.49068128909349]
Large Language Models (LLMs) frequently memorize long sequences verbatim, often with serious legal and privacy implications. We develop a framework to study verbatim memorization in a controlled setting by continuing pre-training from Pythia checkpoints with injected sequences. We find that (1) non-trivial amounts of repetition are necessary for verbatim memorization to happen; (2) later (and presumably better) checkpoints are more likely to memorize verbatim sequences, even for out-of-distribution sequences.
arXiv Detail & Related papers (2024-07-25T07:10:31Z)
MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models [10.783764497590473]
Transformer-based language models (LMs) track contextual information through large, hard-coded input windows. We introduce MemoryPrompt, a leaner approach in which the LM is complemented by a small auxiliary recurrent network that passes information to the LM by prefixing its regular input with a sequence of vectors. tested on a task designed to probe a LM's ability to keep track of multiple fact updates, a MemoryPrompt-augmented LM outperforms much larger LMs that have access to the full input history.
arXiv Detail & Related papers (2024-02-23T11:30:39Z)
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z)
Exploring In-Context Learning of Textless Speech Language Model for Speech Classification Tasks [98.5311231450689]
In-context learning (ICL) has played an essential role in utilizing large language models (LLMs) This study is the first work exploring ICL for speech classification tasks with textless speech LM.
arXiv Detail & Related papers (2023-10-19T05:31:45Z)
Preference-grounded Token-level Guidance for Language Model Fine-tuning [105.88789610320426]
Aligning language models with preferences is an important problem in natural language generation. For LM training, based on the amount of supervised data, we present two *minimalist* learning objectives that utilize the learned guidance. In experiments, our method performs competitively on two distinct representative LM tasks.
arXiv Detail & Related papers (2023-06-01T07:00:07Z)
Small Language Models Improve Giants by Rewriting Their Outputs [18.025736098795296]
We tackle the problem of leveraging training data to improve the performance of large language models (LLMs) without fine-tuning. We create a pool of candidates from the LLM through few-shot prompting and we employ a compact model, the LM-corrector (LMCor), specifically trained to merge these candidates to produce an enhanced output. Experiments on four natural language generation tasks demonstrate that even a small LMCor model (250M) substantially improves the few-shot performance of LLMs (62B), matching and even outperforming standard fine-tuning.
arXiv Detail & Related papers (2023-05-22T22:07:50Z)
Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework. For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement. For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z)
Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z)
Language Model Prior for Low-Resource Neural Machine Translation [85.55729693003829]
We propose a novel approach to incorporate a LM as prior in a neural translation model (TM) We add a regularization term, which pushes the output distributions of the TM to be probable under the LM prior. Results on two low-resource machine translation datasets show clear improvements even with limited monolingual data.
arXiv Detail & Related papers (2020-04-30T16:29:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.