Related papers: Exploratory Evaluation of Speech Content Masking

Exploratory Evaluation of Speech Content Masking

URL: http://arxiv.org/abs/2401.03936v1
Date: Mon, 8 Jan 2024 14:56:03 GMT
Title: Exploratory Evaluation of Speech Content Masking
Authors: Jennifer Williams, Karla Pizzi, Paul-Gauthier Noe, Sneha Das
Abstract summary: We introduce a toy problem that explores an emerging type of privacy called "content masking" We evaluate a baseline masking technique based on modifying sequences of discrete phone representations (phone codes) We investigate three different masking locations and three types of masking strategies: noise substitution, word deletion, and phone sequence reversal.
Score: 7.012446339121189
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most recent speech privacy efforts have focused on anonymizing acoustic speaker attributes but there has not been as much research into protecting information from speech content. We introduce a toy problem that explores an emerging type of privacy called "content masking" which conceals selected words and phrases in speech. In our efforts to define this problem space, we evaluate an introductory baseline masking technique based on modifying sequences of discrete phone representations (phone codes) produced from a pre-trained vector-quantized variational autoencoder (VQ-VAE) and re-synthesized using WaveRNN. We investigate three different masking locations and three types of masking strategies: noise substitution, word deletion, and phone sequence reversal. Our work attempts to characterize how masking affects two downstream tasks: automatic speech recognition (ASR) and automatic speaker verification (ASV). We observe how the different masks types and locations impact these downstream tasks and discuss how these issues may influence privacy goals.

Related papers

Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation [79.13636675697096]
Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS)<n>MQA-RefAVS is a task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations.<n>We propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information.
arXiv Detail & Related papers (2026-02-03T07:47:59Z)
Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence [22.45673628231233]
Action-based video object segmentation addresses this by linking segmentation with action semantics.<n>We take the first step by studying action-based video object segmentation under label noise.<n>We adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them.
arXiv Detail & Related papers (2025-09-20T13:03:43Z)
Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation [54.52905471078152]
We propose a mask-free talking face generation approach while maintaining the 2D-based face editing task.<n>We transform the input images to have closed mouths, using a two-step landmark-based approach trained in an unpaired manner.
arXiv Detail & Related papers (2025-07-28T16:03:36Z)
Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router [72.29811385678168]
We introduce Bind-Your-Avatar, an MM-DiT-based model specifically designed for multi-talking-character video generation in the same scene.<n>Specifically, we propose a novel framework incorporating a fine-grained Embedding Router that binds who' and speak what' together to address the audio-to-character correspondence control.
arXiv Detail & Related papers (2025-06-24T17:50:16Z)
Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits [82.8859060022651]
We introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods. Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and generalization.
arXiv Detail & Related papers (2025-01-07T14:17:47Z)
Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization [0.5497663232622965]
This article investigates a novel speaker anonymization approach using an end-to-end network based on a Vector-Quantized Variational Auto-Encoder (VQ-VAE) It is designed to disentangle these components to specifically target and modify the speaker identity while preserving the linguistic and emotionalcontent. Findings indicate that this method outperforms most baseline techniques in preserving emotional information.
arXiv Detail & Related papers (2024-09-24T08:55:10Z)
SafeEar: Content Privacy-Preserving Audio Deepfake Detection [17.859275594843965]
We propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio into a novel decoupling model that well separates the semantic and acoustic information from audio samples. In this way, no semantic content will be exposed to the detector.
arXiv Detail & Related papers (2024-09-14T02:45:09Z)
Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion [5.483488375189695]
Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input. We present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations.
arXiv Detail & Related papers (2024-09-01T11:51:18Z)
Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers. We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z)
Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE) We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z)
Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach [29.962519978925236]
We propose two kinds of masking approaches: speech-level masking and phoneme-level masking. We pre-trained the model via these two approaches, and evaluated on two downstream tasks, phoneme classification and speaker recognition.
arXiv Detail & Related papers (2022-10-25T07:26:47Z)
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z)
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations. We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images. Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z)
Protecting gender and identity with disentangled speech representations [49.00162808063399]
We show that protecting gender information in speech is more effective than modelling speaker-identity information. We present a novel way to encode gender information and disentangle two sensitive biometric identifiers.
arXiv Detail & Related papers (2021-04-22T13:31:41Z)
Adversarially learning disentangled speech representations for robust multi-factor voice conversion [39.91395314356084]
We propose a disentangled speech representation learning framework based on adversarial learning. Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled. Experimental results show that the proposed framework significantly improves the robustness of VC on multiple factors.
arXiv Detail & Related papers (2021-01-30T08:29:55Z)
Multimodal Speech Recognition with Unstructured Audio Masking [49.01826387664443]
We simulate a more realistic masking scenario during model training, called RandWordMask. Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words. Our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted.
arXiv Detail & Related papers (2020-10-16T21:49:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.