Phoneme Segmentation Using Self-Supervised Speech Models
- URL: http://arxiv.org/abs/2211.01461v1
- Date: Wed, 2 Nov 2022 19:57:31 GMT
- Title: Phoneme Segmentation Using Self-Supervised Speech Models
- Authors: Luke Strgar and David Harwath
- Abstract summary: We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task.
Our model extends transformer-style encoders with strategically placed convolutions that manipulate features learned in pre-training.
- Score: 13.956691231452336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We apply transfer learning to the task of phoneme segmentation and
demonstrate the utility of representations learned in self-supervised
pre-training for the task. Our model extends transformer-style encoders with
strategically placed convolutions that manipulate features learned in
pre-training. Using the TIMIT and Buckeye corpora we train and test the model
in the supervised and unsupervised settings. The latter case is accomplished by
furnishing a noisy label-set with the predictions of a separate model, it
having been trained in an unsupervised fashion. Results indicate our model
eclipses previous state-of-the-art performance in both settings and on both
datasets. Finally, following observations during published code review and
attempts to reproduce past segmentation results, we find a need to disambiguate
the definition and implementation of widely-used evaluation metrics. We resolve
this ambiguity by delineating two distinct evaluation schemes and describing
their nuances.
Related papers
- With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - Unsupervised Continual Semantic Adaptation through Neural Rendering [32.099350613956716]
We study continual multi-scene adaptation for the task of semantic segmentation.
We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model.
We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method.
arXiv Detail & Related papers (2022-11-25T09:31:41Z) - Parameter Decoupling Strategy for Semi-supervised 3D Left Atrium
Segmentation [0.0]
We present a novel semi-supervised segmentation model based on parameter decoupling strategy to encourage consistent predictions from diverse views.
Our method has achieved a competitive result over the state-of-the-art semisupervised methods on the Atrial Challenge dataset.
arXiv Detail & Related papers (2021-09-20T14:51:42Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - D-LEMA: Deep Learning Ensembles from Multiple Annotations -- Application
to Skin Lesion Segmentation [14.266037264648533]
Leveraging a collection of annotators' opinions for an image is an interesting way of estimating a gold standard.
We propose an approach to handle annotators' disagreements when training a deep model.
arXiv Detail & Related papers (2020-12-14T01:51:22Z) - UmBERTo-MTSA @ AcCompl-It: Improving Complexity and Acceptability
Prediction with Multi-task Learning on Self-Supervised Annotations [0.0]
This work describes a self-supervised data augmentation approach used to improve learning models' performances when only a moderate amount of labeled data is available.
Nerve language models are fine-tuned using this procedure in the context of the AcCompl-it shared task at EVALITA 2020.
arXiv Detail & Related papers (2020-11-10T15:50:37Z) - Self-Supervised Contrastive Learning for Unsupervised Phoneme
Segmentation [37.054709598792165]
The model is a convolutional neural network that operates directly on the raw waveform.
It is optimized to identify spectral changes in the signal using the Noise-Contrastive Estimation principle.
At test time, a peak detection algorithm is applied over the model outputs to produce the final boundaries.
arXiv Detail & Related papers (2020-07-27T12:10:21Z) - Deep Semi-supervised Knowledge Distillation for Overlapping Cervical
Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation.
We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining.
Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z) - Improving Semantic Segmentation via Self-Training [75.07114899941095]
We show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm.
We first train a teacher model on labeled data, and then generate pseudo labels on a large set of unlabeled data.
Our robust training framework can digest human-annotated and pseudo labels jointly and achieve top performances on Cityscapes, CamVid and KITTI datasets.
arXiv Detail & Related papers (2020-04-30T17:09:17Z) - Learning What Makes a Difference from Counterfactual Examples and
Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks.
We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task.
Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z) - Document Ranking with a Pretrained Sequence-to-Sequence Model [56.44269917346376]
We show how a sequence-to-sequence model can be trained to generate relevance labels as "target words"
Our approach significantly outperforms an encoder-only model in a data-poor regime.
arXiv Detail & Related papers (2020-03-14T22:29:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.