SACT: Self-Aware Multi-Space Feature Composition Transformer for
Multinomial Attention for Video Captioning
- URL: http://arxiv.org/abs/2006.14262v1
- Date: Thu, 25 Jun 2020 09:11:49 GMT
- Title: SACT: Self-Aware Multi-Space Feature Composition Transformer for
Multinomial Attention for Video Captioning
- Authors: Chiranjib Sur
- Abstract summary: As the feature length increases, it becomes increasingly important to include provisions for improved capturing of the pertinent contents.
In this work, we have introduced a new concept of Self-Aware Composition Transformer (SACT) that is capable of generating Multinomial Attention (MultAtt)
We propose the Self-Aware Composition Transformer model for dense video captioning and apply the technique on two benchmark datasets like ActivityNet and YouCookII.
- Score: 9.89901717499058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video captioning works on the two fundamental concepts, feature detection and
feature composition. While modern day transformers are beneficial in composing
features, they lack the fundamental problems of selecting and understanding of
the contents. As the feature length increases, it becomes increasingly
important to include provisions for improved capturing of the pertinent
contents. In this work, we have introduced a new concept of Self-Aware
Composition Transformer (SACT) that is capable of generating Multinomial
Attention (MultAtt) which is a way of generating distributions of various
combinations of frames. Also, multi-head attention transformer works on the
principle of combining all possible contents for attention, which is good for
natural language classification, but has limitations for video captioning.
Video contents have repetitions and require parsing of important contents for
better content composition. In this work, we have introduced SACT for more
selective attention and combined them for different attention heads for better
capturing of the usable contents for any applications. To address the problem
of diversification and encourage selective utilization, we propose the
Self-Aware Composition Transformer model for dense video captioning and apply
the technique on two benchmark datasets like ActivityNet and YouCookII.
Related papers
- EVC-MF: End-to-end Video Captioning Network with Multi-scale Features [13.85795110061781]
We propose an end-to-end encoder-decoder-based network (EVC-MF) for video captioning.
It efficiently utilizes multi-scale visual and textual features to generate video descriptions.
The results demonstrate that EVC-MF yields competitive performance compared with the state-of-theart methods.
arXiv Detail & Related papers (2024-10-22T02:16:02Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Video Captioning with Aggregated Features Based on Dual Graphs and Gated
Fusion [6.096411752534632]
The application of video captioning models aims at translating content of videos by using accurate natural language.
Existing methods often fail in generating sufficient feature representations of video content.
We propose a video captioning model based on dual graphs and gated fusion.
arXiv Detail & Related papers (2023-08-13T05:18:08Z) - Diverse Video Captioning by Adaptive Spatio-temporal Attention [7.96569366755701]
Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures.
We introduce an adaptive frame selection scheme to reduce the number of required incoming frames.
We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
arXiv Detail & Related papers (2022-08-19T11:21:59Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Exploration of Visual Features and their weighted-additive fusion for
Video Captioning [0.7388859384645263]
Video captioning is a popular task that challenges models to describe events in videos using natural language.
In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context.
arXiv Detail & Related papers (2021-01-14T07:21:13Z) - Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video.
On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer.
This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.