Multi-Granularity Network with Modal Attention for Dense Affective
Understanding
- URL: http://arxiv.org/abs/2106.09964v1
- Date: Fri, 18 Jun 2021 07:37:06 GMT
- Title: Multi-Granularity Network with Modal Attention for Dense Affective
Understanding
- Authors: Baoming Yan, Lin Wang, Ke Gao, Bo Gao, Xiao Liu, Chao Ban, Jiang Yang,
Xiaobo Li
- Abstract summary: In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction.
We propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame.
The proposed method achieves the correlation score of 0.02292 in the EEV challenge.
- Score: 11.076925361793556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video affective understanding, which aims to predict the evoked expressions
by the video content, is desired for video creation and recommendation. In the
recent EEV challenge, a dense affective understanding task is proposed and
requires frame-level affective prediction. In this paper, we propose a
multi-granularity network with modal attention (MGN-MA), which employs
multi-granularity features for better description of the target frame.
Specifically, the multi-granularity features could be divided into frame-level,
clips-level and video-level features, which corresponds to visual-salient
content, semantic-context and video theme information. Then the modal attention
fusion module is designed to fuse the multi-granularity features and emphasize
more affection-relevant modals. Finally, the fused feature is fed into a
Mixtures Of Experts (MOE) classifier to predict the expressions. Further
employing model-ensemble post-processing, the proposed method achieves the
correlation score of 0.02292 in the EEV challenge.
Related papers
- Multi-Modal interpretable automatic video captioning [1.9874264019909988]
We introduce a novel video captioning method trained with multi-modal contrastive loss.
Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions.
arXiv Detail & Related papers (2024-11-11T11:12:23Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm.
Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z) - Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - You Need to Read Again: Multi-granularity Perception Network for Moment
Retrieval in Videos [19.711703590063976]
We propose a novel Multi-Granularity Perception Network (MGPN) that perceives intra-modality and inter-modality information at a multi-granularity level.
Specifically, we formulate moment retrieval as a multi-choice reading comprehension task and integrate human reading strategies into our framework.
arXiv Detail & Related papers (2022-05-25T16:15:46Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.