Multimodal Matching Transformer for Live Commenting
- URL: http://arxiv.org/abs/2002.02649v1
- Date: Fri, 7 Feb 2020 07:19:15 GMT
- Title: Multimodal Matching Transformer for Live Commenting
- Authors: Chaoqun Duan and Lei Cui and Shuming Ma and Furu Wei and Conghui Zhu
and Tiejun Zhao
- Abstract summary: Automatic live commenting aims to provide real-time comments on videos for viewers.
Recent work on this task adopts encoder-decoder models to generate comments.
We propose a multimodal matching transformer to capture the relationships among comments, vision, and audio.
- Score: 97.06576354830736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic live commenting aims to provide real-time comments on videos for
viewers. It encourages users engagement on online video sites, and is also a
good benchmark for video-to-text generation. Recent work on this task adopts
encoder-decoder models to generate comments. However, these methods do not
model the interaction between videos and comments explicitly, so they tend to
generate popular comments that are often irrelevant to the videos. In this
work, we aim to improve the relevance between live comments and videos by
modeling the cross-modal interactions among different modalities. To this end,
we propose a multimodal matching transformer to capture the relationships among
comments, vision, and audio. The proposed model is based on the transformer
framework and can iteratively learn the attention-aware representations for
each modality. We evaluate the model on a publicly available live commenting
dataset. Experiments show that the multimodal matching transformer model
outperforms the state-of-the-art methods.
Related papers
- A Multimodal Transformer for Live Streaming Highlight Prediction [26.787089919015983]
Live streaming requires models to infer without future frames and process complex multimodal interactions.
We introduce a novel Modality Temporal Alignment Module to handle the temporal shift of cross-modal signals.
We propose a novel Border-aware Pairwise Loss to learn from a large-scale dataset and utilize user implicit feedback as a weak supervision signal.
arXiv Detail & Related papers (2024-06-15T04:59:19Z) - Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting [30.96049241998733]
We propose a Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) network to generate diverse video commenting with multiple sentiments and multiple semantics.
Specifically, our sentiment-oriented diversity encoder elegantly combines VAE and random mask mechanism to achieve semantic diversity under sentiment guidance.
A batch attention module is also proposed in this paper to alleviate the problem of missing sentimental samples, caused by the data imbalance.
arXiv Detail & Related papers (2024-04-19T10:43:25Z) - LiveChat: Video Comment Generation from Audio-Visual Multimodal Contexts [8.070778830276275]
We create a large-scale audio-visual multimodal dialogue dataset to facilitate the development of live commenting technologies.
The data is collected from Twitch, with 11 different categories and 575 streamers for a total of 438 hours of video and 3.2 million comments.
We propose a novel multimodal generation model capable of generating live comments that align with the temporal and spatial events within the video.
arXiv Detail & Related papers (2023-10-01T02:35:58Z) - Summarize the Past to Predict the Future: Natural Language Descriptions
of Context Boost Multimodal Object Interaction Anticipation [72.74191015833397]
We propose TransFusion, a multimodal transformer-based architecture.
It exploits the representational power of language by summarizing the action context.
Our model enables more efficient end-to-end learning.
arXiv Detail & Related papers (2023-01-22T21:30:12Z) - With a Little Help from my Temporal Context: Multimodal Egocentric
Action Recognition [95.99542238790038]
We propose a method that learns to attend to surrounding actions in order to improve recognition performance.
To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities.
We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance.
arXiv Detail & Related papers (2021-11-01T15:27:35Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z) - Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video.
On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer.
This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.