Attention Based Encoder Decoder Model for Video Captioning in Nepali (2023)
- URL: http://arxiv.org/abs/2312.07418v3
- Date: Sun, 19 May 2024 15:39:15 GMT
- Title: Attention Based Encoder Decoder Model for Video Captioning in Nepali (2023)
- Authors: Kabita Parajuli, Shashidhar Ram Joshi,
- Abstract summary: This work develops a novel encoder-decoder paradigm for Nepali video captioning to tackle this difficulty.
LSTM and GRU sequence-to-sequence models are used in the model to produce related textual descriptions based on features retrieved from video frames using CNNs.
The efficiency of the model for Devanagari-scripted video captioning is demonstrated by BLEU, METOR, and ROUGE measures, which are used to assess its performance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video captioning in Nepali, a language written in the Devanagari script, presents a unique challenge due to the lack of existing academic work in this domain. This work develops a novel encoder-decoder paradigm for Nepali video captioning to tackle this difficulty. LSTM and GRU sequence-to-sequence models are used in the model to produce related textual descriptions based on features retrieved from video frames using CNNs. Using Google Translate and manual post-editing, a Nepali video captioning dataset is generated from the Microsoft Research Video Description Corpus (MSVD) dataset created using Google Translate, and manual post-editing work. The efficiency of the model for Devanagari-scripted video captioning is demonstrated by BLEU, METOR, and ROUGE measures, which are used to assess its performance.
Related papers
- VCR: Visual Caption Restoration [80.24176572093512]
We introduce Visual Caption Restoration (VCR), a vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images.
This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.
arXiv Detail & Related papers (2024-06-10T16:58:48Z) - VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order.
Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z) - Encoder-Decoder Based Long Short-Term Memory (LSTM) Model for Video
Captioning [0.0]
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-many mapping of video data to text captions.
The many-to-many mapping occurs via an input temporal sequence of video frames to an output sequence of words to form a caption sentence.
arXiv Detail & Related papers (2023-10-02T02:32:26Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - Syntax Customized Video Captioning by Imitating Exemplar Sentences [90.98221715705435]
We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
arXiv Detail & Related papers (2021-12-02T09:08:09Z) - Guidance Module Network for Video Captioning [19.84617164810336]
We find that the normalization of extracted video features can improve the final performance of video captioning.
In this paper, we present a novel architecture which introduces a guidance module to encourage the encoder-decoder model to generate words related to the past and future words in a caption.
arXiv Detail & Related papers (2020-12-20T14:02:28Z) - Unsupervised Multimodal Video-to-Video Translation via Self-Supervised
Learning [92.17835753226333]
We propose a novel unsupervised video-to-video translation model.
Our model decomposes the style and the content using the specialized UV-decoder structure.
Our model can produce photo-realistic videos in a multimodal way.
arXiv Detail & Related papers (2020-04-14T13:44:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.