Fugu-MT 論文翻訳(概要): VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

論文の概要: VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

arxiv url: http://arxiv.org/abs/2408.05758v1
Date: Sun, 11 Aug 2024 12:24:23 GMT
ステータス: 翻訳完了
システム内更新日: 2024-08-13 15:37:52.239769
Title: VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing
Title（参考訳）: VQ-CTAP:音声処理のためのクロスモーダルファイングレードシーケンス表現学習
Authors: Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, Jianhua Tao,
Abstract要約: テキスト音声(TTS)、音声変換(VC)、自動音声認識(ASR)などのタスクでは、クロスモーダルな粒度(フレームレベル)シーケンス表現が望まれる。本稿では,テキストと音声を共同空間に組み込むために,クロスモーダルシーケンストランスコーダを用いた量子コントラスト・トーケン・音響事前学習(VQ-CTAP)手法を提案する。
参考スコア（独自算出の注目度）: 81.32613443072441
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at https://qiangchunyu.github.io/VQCTAP/
Abstract（参考訳）: ディープラーニングは、クロスモーダル表現学習の分野に大きな改善をもたらした。テキスト音声(TTS)、音声変換(VC)、自動音声認識(ASR)などのタスクでは、音声モーダルのパラ言語的情報を強調しつつ、テキストモーダルの意味的内容を強調するクロスモーダルな(フレームレベルの)シーケンス表現が望まれる。本稿では,Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP) と呼ばれる手法を提案する。 The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, provide a promising solution for fine-fine generation and recognition task in speech processing。 VQ-CTAPは細調整や追加構造なしでVCおよびASRタスクに直接適用することができる。本稿では,TTSタスク用に複数の凍結事前学習モジュールを接続し,プラグイン・アンド・プレイ機能を示すシーケンシャル・アウェア・セマンティックコネクタを提案する。各種損失成分の影響を徐々に注入・調整することで,効率的なモデル収束を確保するためのステップ最適化戦略を設計する。さらに,表現能力を向上させるために,意味伝達的パラ言語的整合性損失を提案し,そのモデルが未確認データに対してより一般化し,パラ言語的情報のニュアンスを捕捉できるようにする。さらに、VQ-CTAPは、サンプリングレートが960倍の24kHz入力波形から25Hzの速度で高圧縮音声符号化を実現する。オーディオデモはhttps://qiangchunyu.github.io/VQCTAP/で公開されている。

論文の概要: VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

関連論文リスト