Temporal Bilinear Encoding Network of Audio-Visual Features at Low
Sampling Rates
- URL: http://arxiv.org/abs/2012.10283v1
- Date: Fri, 18 Dec 2020 14:59:34 GMT
- Title: Temporal Bilinear Encoding Network of Audio-Visual Features at Low
Sampling Rates
- Authors: Feiyan Hu, Eva Mohedano, Noel O'Connor and Kevin McGuinness
- Abstract summary: This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate.
We propose Temporal Bilinear Networks (TBEN) for encoding both audio and visual long range temporal information.
- Score: 7.1273332508471725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current deep learning based video classification architectures are typically
trained end-to-end on large volumes of data and require extensive computational
resources. This paper aims to exploit audio-visual information in video
classification with a 1 frame per second sampling rate. We propose Temporal
Bilinear Encoding Networks (TBEN) for encoding both audio and visual long range
temporal information using bilinear pooling and demonstrate bilinear pooling is
better than average pooling on the temporal dimension for videos with low
sampling rate. We also embed the label hierarchy in TBEN to further improve the
robustness of the classifier. Experiments on the FGA240 fine-grained
classification dataset using TBEN achieve a new state-of-the-art
(hit@1=47.95%). We also exploit the possibility of incorporating TBEN with
multiple decoupled modalities like visual semantic and motion features:
experiments on UCF101 sampled at 1 FPS achieve close to state-of-the-art
accuracy (hit@1=91.03%) while requiring significantly less computational
resources than competing approaches for both training and prediction.
Related papers
- Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS
Instance Segmentation [10.789826145990016]
This paper presents a deep learning framework for medical video segmentation.
Our framework explicitly extracts features from neighbouring frames across the temporal dimension.
It incorporates them with a temporal feature blender, which then tokenises the high-level-temporal feature to form a strong global feature encoded via a Swin Transformer.
arXiv Detail & Related papers (2023-02-22T12:09:39Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Capturing Temporal Information in a Single Frame: Channel Sampling
Strategies for Action Recognition [19.220288614585147]
We address the problem of capturing temporal information for video classification in 2D networks, without increasing computational cost.
We propose a novel sampling strategy, where we re-order the channels of the input video, to capture short-term frame-to-frame changes.
Our sampling strategies do not require training from scratch and do not increase the computational cost of training and testing.
arXiv Detail & Related papers (2022-01-25T15:24:37Z) - Dual-view Snapshot Compressive Imaging via Optical Flow Aided Recurrent
Neural Network [14.796204921975733]
Dual-view snapshot compressive imaging (SCI) aims to capture videos from two field-of-views (FoVs) in a single snapshot.
It is challenging for existing model-based decoding algorithms to reconstruct each individual scene.
We propose an optical flow-aided recurrent neural network for dual video SCI systems, which provides high-quality decoding in seconds.
arXiv Detail & Related papers (2021-09-11T14:24:44Z) - End-to-end Neural Video Coding Using a Compound Spatiotemporal
Representation [33.54844063875569]
We propose a hybrid motion compensation (HMC) method that adaptively combines the predictions generated by two approaches.
Specifically, we generate a compoundtemporal representation (STR) through a recurrent information aggregation (RIA) module.
We further design a one-to-many decoder pipeline to generate multiple predictions from the CSTR, including vector-based resampling, adaptive kernel-based resampling, compensation mode selection maps and texture enhancements.
arXiv Detail & Related papers (2021-08-05T19:43:32Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - Reinforcement Learning with Latent Flow [78.74671595139613]
Flow of Latents for Reinforcement Learning (Flare) is a network architecture for RL that explicitly encodes temporal information through latent vector differences.
We show that Flare recovers optimal performance in state-based RL without explicit access to the state velocity.
We also show that Flare achieves state-of-the-art performance on pixel-based challenging continuous control tasks within the DeepMind control benchmark suite.
arXiv Detail & Related papers (2021-01-06T03:50:50Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.