Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting
- URL: http://arxiv.org/abs/2404.12782v1
- Date: Fri, 19 Apr 2024 10:43:25 GMT
- Title: Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting
- Authors: Fengyi Fu, Shancheng Fang, Weidong Chen, Zhendong Mao,
- Abstract summary: We propose a Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) network to generate diverse video commenting with multiple sentiments and multiple semantics.
Specifically, our sentiment-oriented diversity encoder elegantly combines VAE and random mask mechanism to achieve semantic diversity under sentiment guidance.
A batch attention module is also proposed in this paper to alleviate the problem of missing sentimental samples, caused by the data imbalance.
- Score: 30.96049241998733
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic live video commenting is with increasing attention due to its significance in narration generation, topic explanation, etc. However, the diverse sentiment consideration of the generated comments is missing from the current methods. Sentimental factors are critical in interactive commenting, and lack of research so far. Thus, in this paper, we propose a Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) network which consists of a sentiment-oriented diversity encoder module and a batch attention module, to achieve diverse video commenting with multiple sentiments and multiple semantics. Specifically, our sentiment-oriented diversity encoder elegantly combines VAE and random mask mechanism to achieve semantic diversity under sentiment guidance, which is then fused with cross-modal features to generate live video comments. Furthermore, a batch attention module is also proposed in this paper to alleviate the problem of missing sentimental samples, caused by the data imbalance, which is common in live videos as the popularity of videos varies. Extensive experiments on Livebot and VideoIC datasets demonstrate that the proposed So-TVAE outperforms the state-of-the-art methods in terms of the quality and diversity of generated comments. Related code is available at https://github.com/fufy1024/So-TVAE.
Related papers
- Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain
Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos.
We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames.
Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z) - eMotions: A Large-Scale Dataset for Emotion Recognition in Short Videos [7.011656298079659]
The prevailing use of short videos (SVs) leads to the necessity of emotion recognition in SVs.
Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos.
We present an end-to-end baseline method AV-CPNet that employs the video transformer to better learn semantically relevant representations.
arXiv Detail & Related papers (2023-11-29T03:24:30Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Variational Stacked Local Attention Networks for Diverse Video
Captioning [2.492343817244558]
Variational Stacked Local Attention Network exploits low-rank bilinear pooling for self-attentive feature interaction.
We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity.
arXiv Detail & Related papers (2022-01-04T05:14:34Z) - Dense Interaction Learning for Video-based Person Re-identification [75.03200492219003]
We propose a hybrid framework, Dense Interaction Learning (DenseIL), to tackle video-based person re-ID difficulties.
DenseIL contains a CNN encoder and a Dense Interaction (DI) decoder.
Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.
arXiv Detail & Related papers (2021-03-16T12:22:08Z) - Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video.
On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer.
This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z) - A Transformer-based joint-encoding for Emotion Recognition and Sentiment
Analysis [8.927538538637783]
This paper describes a Transformer-based joint-encoding (TBJE) for the task of Emotion Recognition and Sentiment Analysis.
In addition to use the Transformer architecture, our approach relies on a modular co-attention and a glimpse layer to jointly encode one or more modalities.
arXiv Detail & Related papers (2020-06-29T11:51:46Z) - SACT: Self-Aware Multi-Space Feature Composition Transformer for
Multinomial Attention for Video Captioning [9.89901717499058]
As the feature length increases, it becomes increasingly important to include provisions for improved capturing of the pertinent contents.
In this work, we have introduced a new concept of Self-Aware Composition Transformer (SACT) that is capable of generating Multinomial Attention (MultAtt)
We propose the Self-Aware Composition Transformer model for dense video captioning and apply the technique on two benchmark datasets like ActivityNet and YouCookII.
arXiv Detail & Related papers (2020-06-25T09:11:49Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z) - Multimodal Matching Transformer for Live Commenting [97.06576354830736]
Automatic live commenting aims to provide real-time comments on videos for viewers.
Recent work on this task adopts encoder-decoder models to generate comments.
We propose a multimodal matching transformer to capture the relationships among comments, vision, and audio.
arXiv Detail & Related papers (2020-02-07T07:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.