Encoder-Decoder Based Long Short-Term Memory (LSTM) Model for Video
Captioning
- URL: http://arxiv.org/abs/2401.02052v1
- Date: Mon, 2 Oct 2023 02:32:26 GMT
- Title: Encoder-Decoder Based Long Short-Term Memory (LSTM) Model for Video
Captioning
- Authors: Sikiru Adewale, Tosin Ige, Bolanle Hafiz Matti
- Abstract summary: This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-many mapping of video data to text captions.
The many-to-many mapping occurs via an input temporal sequence of video frames to an output sequence of words to form a caption sentence.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work demonstrates the implementation and use of an encoder-decoder model
to perform a many-to-many mapping of video data to text captions. The
many-to-many mapping occurs via an input temporal sequence of video frames to
an output sequence of words to form a caption sentence. Data preprocessing,
model construction, and model training are discussed. Caption correctness is
evaluated using 2-gram BLEU scores across the different splits of the dataset.
Specific examples of output captions were shown to demonstrate model generality
over the video temporal dimension. Predicted captions were shown to generalize
over video action, even in instances where the video scene changed
dramatically. Model architecture changes are discussed to improve sentence
grammar and correctness.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order.
Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z) - Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
Video Captioning [93.6842670770983]
Vid2Seq is a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries.
The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks.
arXiv Detail & Related papers (2023-02-27T19:53:49Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - SwinBERT: End-to-End Transformers with Sparse Attention for Video
Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning.
Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input.
Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z) - Guidance Module Network for Video Captioning [19.84617164810336]
We find that the normalization of extracted video features can improve the final performance of video captioning.
In this paper, we present a novel architecture which introduces a guidance module to encourage the encoder-decoder model to generate words related to the past and future words in a caption.
arXiv Detail & Related papers (2020-12-20T14:02:28Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.