Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource
Parallel Data
- URL: http://arxiv.org/abs/2204.04645v1
- Date: Sun, 10 Apr 2022 10:25:37 GMT
- Title: Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource
Parallel Data
- Authors: Yu Kang, Tianqiao Liu, Hang Li, Yang Hao, Wenbiao Ding
- Abstract summary: Multimodal pre-training for audio-and-text has been proven to be effective and has significantly improved the performance of many downstream speech understanding tasks.
However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data.
In this paper, we investigate whether it is possible to pre-train an audio-text model with low-resource parallel data.
- Score: 15.658471125219224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal pre-training for audio-and-text has recently been proved to be
effective and has significantly improved the performance of many downstream
speech understanding tasks. However, these state-of-the-art pre-training
audio-text models work well only when provided with large amount of parallel
audio-and-text data, which brings challenges on many languages that are rich in
unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we
investigate whether it is possible to pre-train an audio-text multimodal model
with extremely low-resource parallel data and extra non-parallel unimodal data.
Our pre-training framework consists of the following components: (1)
Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input
text (audio) representations from a noisy version of itself. (2) Cross-modal
Denoising Auto-Encoding (CDAE), which is pre-trained to reconstruct the input
text (audio), given both a noisy version of the input text (audio) and the
corresponding translated noisy audio features (text embeddings). (3) Iterative
Denoising Process (IDP), which iteratively translates raw audio (text) and the
corresponding text embeddings (audio features) translated from previous
iteration into the new less-noisy text embeddings (audio features). We adapt a
dual cross-modal Transformer as our backbone model which consists of two
unimodal encoders for IDAE and two cross-modal encoders for CDAE and IDP. Our
method achieves comparable performance on multiple downstream speech
understanding tasks compared with the model pre-trained on fully parallel data,
demonstrating the great potential of the proposed method. Our code is available
at: \url{https://github.com/KarlYuKang/Low-Resource-Multimodal-Pre-training}.
Related papers
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation [43.35578187209748]
Foley audio faces significant challenges in the AI-generated content (AIGC) landscape.
Current text-to-audio technology relies on detailed and acoustically relevant textual descriptions.
We introduce the Multi-modal Image and Narrative Text Dubbing dataset (MINT)
MINT is designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing.
arXiv Detail & Related papers (2024-06-15T10:47:36Z) - Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models.
We thus obtain an audio-textual (multimodal) representation for each data sample.
We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z) - Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - CTAL: Pre-training Cross-modal Transformer for Audio-and-Language
Representations [20.239063010740853]
We present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language.
We observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification.
arXiv Detail & Related papers (2021-09-01T04:18:19Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.