Audio-based Near-Duplicate Video Retrieval with Audio Similarity
Learning
- URL: http://arxiv.org/abs/2010.08737v2
- Date: Mon, 11 Jan 2021 12:33:05 GMT
- Title: Audio-based Near-Duplicate Video Retrieval with Audio Similarity
Learning
- Authors: Pavlos Avgoustinakis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos,
Andreas L. Symeonidis, Ioannis Kompatsiaris
- Abstract summary: We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs.
We train our network following a triplet generation process and optimize the triplet loss function.
The proposed approach achieves very competitive results compared to three state-of-the-art methods.
- Score: 19.730467023817123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we address the problem of audio-based near-duplicate video
retrieval. We propose the Audio Similarity Learning (AuSiL) approach that
effectively captures temporal patterns of audio similarity between video pairs.
For the robust similarity calculation between two videos, we first extract
representative audio-based video descriptors by leveraging transfer learning
based on a Convolutional Neural Network (CNN) trained on a large scale dataset
of audio events, and then we calculate the similarity matrix derived from the
pairwise similarity of these descriptors. The similarity matrix is subsequently
fed to a CNN network that captures the temporal structures existing within its
content. We train our network following a triplet generation process and
optimizing the triplet loss function. To evaluate the effectiveness of the
proposed approach, we have manually annotated two publicly available video
datasets based on the audio duplicity between their videos. The proposed
approach achieves very competitive results compared to three state-of-the-art
methods. Also, unlike the competing methods, it is very robust to the retrieval
of audio duplicates generated with speed transformations.
Related papers
- Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval [3.5570874721859016]
We propose a two-staged training procedure in which multiple retrieval models are first trained without estimated correspondences.
In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets.
We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting.
arXiv Detail & Related papers (2024-08-21T14:10:58Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - CATR: Combinatorial-Dependence Audio-Queried Transformer for
Audio-Visual Video Segmentation [43.562848631392384]
Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames.
We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
arXiv Detail & Related papers (2023-09-18T12:24:02Z) - Semi-supervised 3D Video Information Retrieval with Deep Neural Network
and Bi-directional Dynamic-time Warping Algorithm [14.39527406033429]
The proposed algorithm is designed to handle large video datasets and retrieve the most related videos to a given inquiry video clip.
We split both the candidate and the inquiry videos into a sequence of clips and convert each clip to a representation vector using an autoencoder-backed deep neural network.
We then calculate a similarity measure between the sequences of embedding vectors using a bi-directional dynamic time-warping method.
arXiv Detail & Related papers (2023-09-03T03:10:18Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - End-to-End Lip Synchronisation Based on Pattern Classification [15.851638021923875]
We propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream.
We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets.
arXiv Detail & Related papers (2020-05-18T11:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.