Related papers: Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning

Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning

URL: http://arxiv.org/abs/2506.04527v1
Date: Thu, 05 Jun 2025 00:24:00 GMT
Title: Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning
Authors: Hien Ohnaka, Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto,
Abstract summary: We propose a model to obtain phonemic and prosodic labels of speech that are coherent with graphemes.<n> Experiments showed that the proposed method significantly improved the consistency between graphemes and the predicted labels.
Score: 9.413818055887763
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a model to obtain phonemic and prosodic labels of speech that are coherent with graphemes. Unlike previous methods that simply fine-tune a pre-trained ASR model with the labels, the proposed model conditions the label generation on corresponding graphemes by two methods: 1) Add implicit grapheme conditioning through prompt encoder using pre-trained BERT features. 2) Explicitly prune the label hypotheses inconsistent with the grapheme during inference. These methods enable obtaining parallel data of speech, the labels, and graphemes, which is applicable to various downstream tasks such as text-to-speech and accent estimation from text. Experiments showed that the proposed method significantly improved the consistency between graphemes and the predicted labels. Further, experiments on accent estimation task confirmed that the created parallel data by the proposed method effectively improve the estimation accuracy.

Related papers

Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation [4.314729314139958]
We propose a method for annotating phonemic and prosodic labels on a given audio-transcript pair.<n>We employ a decoding strategy that utilizes dictionary prior knowledge to correct errors in phonemic labeling.<n>The subjective evaluation results indicate that the naturalness of speech synthesized by the TTS model, trained with labels annotated using our method, is comparable to that of a model trained with manual annotations.
arXiv Detail & Related papers (2025-06-09T11:10:24Z)
Pre-Training and Prompting for Few-Shot Node Classification on Text-Attributed Graphs [35.44563283531432]
Text-attributed graph (TAG) is one kind of important real-world graph-structured data with each node associated with raw texts. For TAGs, traditional few-shot node classification methods directly conduct training on the pre-processed node features. We propose P2TAG, a framework designed for few-shot node classification on TAGs with graph pre-training and prompting.
arXiv Detail & Related papers (2024-07-22T07:24:21Z)
Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images [76.47980643420375]
This paper builds on the hypothesis that there is an inherent data-hungry matter in learning semantic correspondences. We demonstrate a simple machine annotator reliably enriches paired key points via machine supervision. Our models surpass current state-of-the-art models on semantic correspondence learning benchmarks like SPair-71k, PF-PASCAL, and PF-WILLOW.
arXiv Detail & Related papers (2023-11-30T13:22:15Z)
Label Matching Semi-Supervised Object Detection [85.99282969977541]
Semi-supervised object detection has made significant progress with the development of mean teacher driven self-training. Label mismatch problem is not yet fully explored in the previous works, leading to severe confirmation bias during self-training. We propose a simple yet effective LabelMatch framework from two different yet complementary perspectives.
arXiv Detail & Related papers (2022-06-14T05:59:41Z)
Enhancing Continual Learning with Global Prototypes: Counteracting Negative Representation Drift [16.177180198865848]
Continual learning aims to learn a sequence of tasks over time, with data distributions shifting from one task to another. Some negative representation drift can result in catastrophic forgetting, by causing the locally learned class prototypes and data representations to correlate poorly across tasks. We propose a method that finds global prototypes to guide the learning, and learns data representations with the regularization of the self-supervised information.
arXiv Detail & Related papers (2022-05-24T16:41:30Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Hierarchical Heterogeneous Graph Representation Learning for Short Text Classification [60.233529926965836]
We propose a new method called SHINE, which is based on graph neural network (GNN) for short text classification. First, we model the short text dataset as a hierarchical heterogeneous graph consisting of word-level component graphs. Then, we dynamically learn a short document graph that facilitates effective label propagation among similar short texts.
arXiv Detail & Related papers (2021-10-30T05:33:05Z)
Joint Graph Learning and Matching for Semantic Feature Correspondence [69.71998282148762]
We propose a joint emphgraph learning and matching network, named GLAM, to explore reliable graph structures for boosting graph matching. The proposed method is evaluated on three popular visual matching benchmarks (Pascal VOC, Willow Object and SPair-71k) It outperforms previous state-of-the-art graph matching methods by significant margins on all benchmarks.
arXiv Detail & Related papers (2021-09-01T08:24:02Z)
Graph-based Label Propagation for Semi-Supervised Speaker Identification [10.87690067963342]
We propose a graph-based semi-supervised learning approach for speaker identification in the household scenario. We show that this approach makes effective use of unlabeled data and improves speaker identification accuracy compared to two state-of-the-art scoring methods.
arXiv Detail & Related papers (2021-06-15T15:10:33Z)
Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset Augmentation Using Graph Theory [21.06607915149245]
We construct a paraphrase graph from the provided sentence pair labels, and create an augmented dataset by directly inferring labels from the original sentence pairs using a transitivity property. We evaluate our methods on paraphrase models trained using these datasets starting from a pretrained BERT model, and find that the automatically-enhanced training sets result in more accurate models.
arXiv Detail & Related papers (2020-11-03T17:18:03Z)
Handling Missing Data with Graph Representation Learning [62.59831675688714]
We propose GRAPE, a graph-based framework for feature imputation as well as label prediction. Under GRAPE, the feature imputation is formulated as an edge-level prediction task and the label prediction as a node-level prediction task. Experimental results on nine benchmark datasets show that GRAPE yields 20% lower mean absolute error for imputation tasks and 10% lower for label prediction tasks.
arXiv Detail & Related papers (2020-10-30T17:59:13Z)
Line Graph Neural Networks for Link Prediction [71.00689542259052]
We consider the graph link prediction task, which is a classic graph analytical problem with many real-world applications. In this formalism, a link prediction problem is converted to a graph classification task. We propose to seek a radically different and novel path by making use of the line graphs in graph theory. In particular, each node in a line graph corresponds to a unique edge in the original graph. Therefore, link prediction problems in the original graph can be equivalently solved as a node classification problem in its corresponding line graph, instead of a graph classification task.
arXiv Detail & Related papers (2020-10-20T05:54:31Z)
Inducing Alignment Structure with Gated Graph Attention Networks for Sentence Matching [24.02847802702168]
This paper proposes a graph-based approach for sentence matching. We represent a sentence pair as a graph with several carefully design strategies. We then employ a novel gated graph attention network to encode the constructed graph for sentence matching.
arXiv Detail & Related papers (2020-10-15T11:25:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.