Hierarchical Memory Decoding for Video Captioning
- URL: http://arxiv.org/abs/2002.11886v1
- Date: Thu, 27 Feb 2020 02:48:10 GMT
- Title: Hierarchical Memory Decoding for Video Captioning
- Authors: Aming Wu, Yahong Han
- Abstract summary: Memory network (MemNet) has the advantage of storing long-term information.
MemNet has not been well exploited for video captioning.
In this paper, we devise a novel memory decoder for video captioning.
- Score: 43.51506421744577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances of video captioning often employ a recurrent neural network
(RNN) as the decoder. However, RNN is prone to diluting long-term information.
Recent works have demonstrated memory network (MemNet) has the advantage of
storing long-term information. However, as the decoder, it has not been well
exploited for video captioning. The reason partially comes from the difficulty
of sequence decoding with MemNet. Instead of the common practice, i.e.,
sequence decoding with RNN, in this paper, we devise a novel memory decoder for
video captioning. Concretely, after obtaining representation of each frame
through a pre-trained network, we first fuse the visual and lexical
information. Then, at each time step, we construct a multi-layer MemNet-based
decoder, i.e., in each layer, we employ a memory set to store previous
information and an attention mechanism to select the information related to the
current input. Thus, this decoder avoids the dilution of long-term information.
And the multi-layer architecture is helpful for capturing dependencies between
frames and word sequences. Experimental results show that even without the
encoding network, our decoder still could obtain competitive performance and
outperform the performance of RNN decoder. Furthermore, compared with one-layer
RNN decoder, our decoder has fewer parameters.
Related papers
- Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - PottsMGNet: A Mathematical Explanation of Encoder-Decoder Based Neural
Networks [7.668812831777923]
We study the encoder-decoder-based network architecture from the algorithmic perspective.
We use the two-phase Potts model for image segmentation as an example for our explanations.
We show that the resulting discrete PottsMGNet is equivalent to an encoder-decoder-based network.
arXiv Detail & Related papers (2023-07-18T07:48:48Z) - Training Invertible Neural Networks as Autoencoders [3.867363075280544]
We present methods to train Invertible Neural Networks (INNs) as (variational) autoencoders which we call INN (variational) autoencoders.
Our experiments on MNIST, CIFAR and CelebA show that for low bottleneck sizes our INN autoencoder results similar to the classical autoencoder.
arXiv Detail & Related papers (2023-03-20T16:24:06Z) - An Image captioning algorithm based on the Hybrid Deep Learning
Technique (CNN+GRU) [0.0]
We present a CNN-GRU encoder decode framework for caption-to-image reconstructor.
It handles the semantic context into consideration as well as the time complexity.
The suggested model outperforms the state-of-the-art LSTM-A5 model for picture captioning in terms of time complexity and accuracy.
arXiv Detail & Related papers (2023-01-06T10:00:06Z) - Graph Neural Networks for Channel Decoding [71.15576353630667]
We showcase competitive decoding performance for various coding schemes, such as low-density parity-check (LDPC) and BCH codes.
The idea is to let a neural network (NN) learn a generalized message passing algorithm over a given graph.
We benchmark our proposed decoder against state-of-the-art in conventional channel decoding as well as against recent deep learning-based results.
arXiv Detail & Related papers (2022-07-29T15:29:18Z) - KRNet: Towards Efficient Knowledge Replay [50.315451023983805]
A knowledge replay technique has been widely used in many tasks such as continual learning and continuous domain adaptation.
We propose a novel and efficient knowledge recording network (KRNet) which directly maps an arbitrary sample identity number to the corresponding datum.
Our KRNet requires significantly less storage cost for the latent codes and can be trained without the encoder sub-network.
arXiv Detail & Related papers (2022-05-23T08:34:17Z) - Small Lesion Segmentation in Brain MRIs with Subpixel Embedding [105.1223735549524]
We present a method to segment MRI scans of the human brain into ischemic stroke lesion and normal tissues.
We propose a neural network architecture in the form of a standard encoder-decoder where predictions are guided by a spatial expansion embedding network.
arXiv Detail & Related papers (2021-09-18T00:21:17Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z) - Audio Captioning Transformer [44.68751180694813]
Audio captioning aims to automatically generate a natural language description of an audio clip.
Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder.
We propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free.
arXiv Detail & Related papers (2021-07-21T00:31:50Z) - Analysis of Convolutional Decoder for Image Caption Generation [1.2183405753834562]
Convolutional Neural Networks have been proposed for Sequence Modelling tasks such as Image Caption Generation.
Unlike Recurrent Neural Network based Decoder, Convolutional Decoder for Image Captioning does not generally benefit from increase in network depth.
We observe that Convolutional Decoders show performance comparable with Recurrent Decoders only when trained using sentences of smaller length which contain up to 15 words.
arXiv Detail & Related papers (2021-03-08T17:25:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.