UniBriVL: Robust Universal Representation and Generation of Audio Driven
Diffusion Models
- URL: http://arxiv.org/abs/2307.15898v2
- Date: Sat, 9 Sep 2023 11:14:22 GMT
- Title: UniBriVL: Robust Universal Representation and Generation of Audio Driven
Diffusion Models
- Authors: Sen Fang, Bowen Gao, Yangjian Wu, Teik Toe Teoh
- Abstract summary: We propose a novel universal language representation learning method called UniBriVL.
Universal BriVL embeds audio, image, and text into a shared space, enabling the realization of various multimodal applications.
- Score: 0.7510165488300369
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large models have been recognized for their advantages in various
performance and downstream tasks. The development of these models is crucial
towards achieving general artificial intelligence in the future. In this paper,
we propose a novel universal language representation learning method called
UniBriVL, which is based on Bridging-Vision-and-Language (BriVL). Universal
BriVL embeds audio, image, and text into a shared space, enabling the
realization of various multimodal applications. Our approach addresses major
challenges in robust language (both text and audio) representation learning and
effectively captures the correlation between audio and image. Additionally, we
demonstrate the qualitative evaluation of the generated images from UniBriVL,
which serves to highlight the potential of our approach in creating images from
audio. Overall, our experimental results demonstrate the efficacy of UniBriVL
in downstream tasks and its ability to choose appropriate images from audio.
The proposed approach has the potential for various applications such as speech
recognition, music signal processing, and captioning systems.
Related papers
- Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation [5.528860524494717]
This paper presents an innovative approach called BGTAI to simplify multimodal understanding by utilizing gloss-based annotation.
By representing text and audio as gloss notations that omit complex semantic nuances, a better alignment with images can potentially be achieved.
arXiv Detail & Related papers (2024-10-04T04:59:50Z) - VHASR: A Multimodal Speech Recognition System With Vision Hotwords [74.94430247036945]
VHASR is a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability.
VHASR can effectively utilize key information in images to enhance the model's speech recognition ability.
arXiv Detail & Related papers (2024-10-01T16:06:02Z) - Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - VILAS: Exploring the Effects of Vision and Language Context in Automatic
Speech Recognition [18.19998336526969]
ViLaS (Vision and Language into Automatic Speech Recognition) is a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism.
To explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions.
arXiv Detail & Related papers (2023-05-31T16:01:20Z) - VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z) - New Audio Representations Image Gan Generation from BriVL [0.0]
We propose a robust audio representation learning method WavBriVL based on Bridging-Vision-and-Language (BriVL)
WavBriVL projects audio, image and text into a shared embedded space, so that multi-modal applications can be realized.
arXiv Detail & Related papers (2023-03-08T13:58:55Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z) - Fine-Grained Grounding for Multimodal Speech Recognition [49.01826387664443]
We propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals.
In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features.
arXiv Detail & Related papers (2020-10-05T23:06:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.