Improving Speech Representation Learning via Speech-level and
Phoneme-level Masking Approach
- URL: http://arxiv.org/abs/2210.13805v1
- Date: Tue, 25 Oct 2022 07:26:47 GMT
- Title: Improving Speech Representation Learning via Speech-level and
Phoneme-level Masking Approach
- Authors: Xulong Zhang, Jianzong Wang, Ning Cheng, Kexin Zhu, Jing Xiao
- Abstract summary: We propose two kinds of masking approaches: speech-level masking and phoneme-level masking.
We pre-trained the model via these two approaches, and evaluated on two downstream tasks, phoneme classification and speaker recognition.
- Score: 29.962519978925236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recovering the masked speech frames is widely applied in speech
representation learning. However, most of these models use random masking in
the pre-training. In this work, we proposed two kinds of masking approaches:
(1) speech-level masking, making the model to mask more speech segments than
silence segments, (2) phoneme-level masking, forcing the model to mask the
whole frames of the phoneme, instead of phoneme pieces. We pre-trained the
model via these two approaches, and evaluated on two downstream tasks, phoneme
classification and speaker recognition. The experiments demonstrated that the
proposed masking approaches are beneficial to improve the performance of speech
representation.
Related papers
- MaskSR: Masked Language Model for Full-band Speech Restoration [7.015213589171985]
Speech restoration aims at restoring high quality speech in the presence of a diverse set of distortions.
We propose MaskSR, a masked language model capable of restoring full-band 44.1 kHz speech jointly considering noise, reverb, clipping, and low bandwidth.
arXiv Detail & Related papers (2024-06-04T08:23:57Z) - Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition [27.35304346509647]
We introduce speaker labels into an autoregressive transformer-based speech recognition model.
We then propose a novel speaker mask branch to detection the speech segments of individual speakers.
With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously.
arXiv Detail & Related papers (2023-12-18T06:29:53Z) - DFormer: Diffusion-guided Transformer for Universal Image Segmentation [86.73405604947459]
The proposed DFormer views universal image segmentation task as a denoising process using a diffusion model.
At inference, our DFormer directly predicts the masks and corresponding categories from a set of randomly-generated masks.
Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val 2017 set.
arXiv Detail & Related papers (2023-06-06T06:33:32Z) - InforMask: Unsupervised Informative Masking for Language Model
Pretraining [13.177839395411858]
We propose a new unsupervised masking strategy for training masked language models.
InforMask exploits Pointwise Mutual Information (PMI) to select the most informative tokens to mask.
arXiv Detail & Related papers (2022-10-21T07:10:56Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - What You See is What You Classify: Black Box Attributions [61.998683569022006]
We train a deep network, the Explainer, to predict attributions for a pre-trained black-box classifier, the Explanandum.
Unlike most existing approaches, ours is capable of directly generating very distinct class-specific masks.
We show that our attributions are superior to established methods both visually and quantitatively.
arXiv Detail & Related papers (2022-05-23T12:30:04Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z) - Per-Pixel Classification is Not All You Need for Semantic Segmentation [184.2905747595058]
Mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks.
We propose MaskFormer, a simple mask classification model which predicts a set of binary masks.
Our method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
arXiv Detail & Related papers (2021-07-13T17:59:50Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.