NAViDAd: A No-Reference Audio-Visual Quality Metric Based on a Deep
Autoencoder
- URL: http://arxiv.org/abs/2001.11406v2
- Date: Tue, 4 Feb 2020 19:08:49 GMT
- Title: NAViDAd: A No-Reference Audio-Visual Quality Metric Based on a Deep
Autoencoder
- Authors: Helard Martinez, M. C. Farias, A. Hines
- Abstract summary: We propose a No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd)
The model is formed by a 2-layer framework that includes a deep autoencoder layer and a classification layer.
The model performed well when tested against the UnB-AV and the LiveNetflix-II databases.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The development of models for quality prediction of both audio and video
signals is a fairly mature field. But, although several multimodal models have
been proposed, the area of audio-visual quality prediction is still an emerging
area. In fact, despite the reasonable performance obtained by combination and
parametric metrics, currently there is no reliable pixel-based audio-visual
quality metric. The approach presented in this work is based on the assumption
that autoencoders, fed with descriptive audio and video features, might produce
a set of features that is able to describe the complex audio and video
interactions. Based on this hypothesis, we propose a No-Reference Audio-Visual
Quality Metric Based on a Deep Autoencoder (NAViDAd). The model visual features
are natural scene statistics (NSS) and spatial-temporal measures of the video
component. Meanwhile, the audio features are obtained by computing the
spectrogram representation of the audio component. The model is formed by a
2-layer framework that includes a deep autoencoder layer and a classification
layer. These two layers are stacked and trained to build the deep neural
network model. The model is trained and tested using a large set of stimuli,
containing representative audio and video artifacts. The model performed well
when tested against the UnB-AV and the LiveNetflix-II databases. %Results shows
that this type of approach produces quality scores that are highly correlated
to subjective quality scores.
Related papers
- From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation.
VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively.
Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Audiovisual Saliency Prediction in Uncategorized Video Sequences based
on Audio-Video Correlation [0.0]
This work aims to provide a generic audio/video saliency model augmenting a visual saliency map with an audio saliency map computed by synchronizing low-level audio and visual features.
The proposed model was evaluated using different criteria against eye fixations data for a publicly available DIEM video dataset.
arXiv Detail & Related papers (2021-01-07T14:22:29Z) - COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio
Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags.
We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z) - How deep is your encoder: an analysis of features descriptors for an
autoencoder-based audio-visual quality metric [2.191505742658975]
The No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd) deals with this problem from a machine learning perspective.
A basic implementation of NAViDAd was able to produce accurate predictions tested with a range of different audio-visual databases.
arXiv Detail & Related papers (2020-03-24T20:15:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.