IPA-CLIP: Integrating Phonetic Priors into Vision and Language
Pretraining
- URL: http://arxiv.org/abs/2303.03144v1
- Date: Mon, 6 Mar 2023 13:59:37 GMT
- Title: IPA-CLIP: Integrating Phonetic Priors into Vision and Language
Pretraining
- Authors: Chihaya Matsuhira, Marc A. Kastner, Takahiro Komamizu, Takatsugu
Hirayama, Keisuke Doman, Yasutomo Kawanishi, Ichiro Ide
- Abstract summary: This paper inserts phonetic prior into Contrastive Language-Image Pretraining (CLIP)
IPA-CLIP comprises this pronunciation encoder and the original CLIP encoders (image and text)
- Score: 8.129944388402839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large-scale Vision and Language (V\&L) pretraining has become the
standard backbone of many multimedia systems. While it has shown remarkable
performance even in unseen situations, it often performs in ways not intuitive
to humans. Particularly, they usually do not consider the pronunciation of the
input, which humans would utilize to understand language, especially when it
comes to unknown words. Thus, this paper inserts phonetic prior into
Contrastive Language-Image Pretraining (CLIP), one of the V\&L pretrained
models, to make it consider the pronunciation similarity among its
pronunciation inputs. To achieve this, we first propose a phoneme embedding
that utilizes the phoneme relationships provided by the International Phonetic
Alphabet (IPA) chart as a phonetic prior. Next, by distilling the frozen CLIP
text encoder, we train a pronunciation encoder employing the IPA-based
embedding. The proposed model named IPA-CLIP comprises this pronunciation
encoder and the original CLIP encoders (image and text). Quantitative
evaluation reveals that the phoneme distribution on the embedding space
represents phonetic relationships more accurately when using the proposed
phoneme embedding. Furthermore, in some multimodal retrieval tasks, we confirm
that the proposed pronunciation encoder enhances the performance of the text
encoder and that the pronunciation encoder handles nonsense words in a more
phonetic manner than the text encoder. Finally, qualitative evaluation verifies
the correlation between the pronunciation encoder and human perception
regarding pronunciation similarity.
Related papers
- Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words [10.2138250640885]
We develop a large language model (LLM) based automatic speech recognition (ASR) system that can be contextualized by providing keywords in text prompts.
We adopt decoder-only architecture and use our in-house LLM, PLaMo-100B, pre-trained from scratch using datasets dominated by Japanese and English texts as the decoder.
arXiv Detail & Related papers (2024-08-15T08:50:58Z) - Phoneme-aware Encoding for Prefix-tree-based Contextual ASR [45.161909551392085]
Tree-constrained Pointer Generator ( TCPGen) has shown promise for this purpose.
We propose extending it with phoneme-aware encoding to better recognize words of unusual pronunciations.
arXiv Detail & Related papers (2023-12-15T07:37:09Z) - DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation
Detection and Correction [1.8322859214908722]
We present a highly-precise, PDA-compatible pronunciation learning framework for the task of TTS mispronunciation detection and correction.
We also propose a novel mispronunciation detection model called DTW-SiameseNet, which employs metric learning with a Siamese architecture for Dynamic Time Warping (DTW) with triplet loss.
Human evaluation shows our proposed approach improves pronunciation accuracy on average by 6% compared to strong phoneme-based and audio-based baselines.
arXiv Detail & Related papers (2023-03-01T01:53:11Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme
Representations for Text to Speech [104.65639892109381]
We propose MixedPhoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability.
Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline.
arXiv Detail & Related papers (2022-03-31T17:12:26Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - A systematic comparison of grapheme-based vs. phoneme-based label units
for encoder-decoder-attention models [42.761409598613845]
We do a systematic comparison between grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR model.
Experiments performed on the Switchboard 300h and LibriSpeech benchmarks show that phoneme-based modeling is competitive to grapheme-based encoder-decoder-attention modeling.
arXiv Detail & Related papers (2020-05-19T09:54:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.