Class-attention Video Transformer for Engagement Intensity Prediction
- URL: http://arxiv.org/abs/2208.07216v1
- Date: Fri, 12 Aug 2022 01:21:30 GMT
- Title: Class-attention Video Transformer for Engagement Intensity Prediction
- Authors: Xusheng Ai, Victor S. Sheng, Chunhua Li
- Abstract summary: CavT is a method to uniformly perform end-to-end learning on variant-length long videos and fixed-length short videos.
CavT achieves the state-of-the-art MSE (0.0495) on the EmotiW-EP dataset, and the state-of-the-art MSE (0.0377) on the DAiSEE dataset.
- Score: 20.430266245901684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In order to deal with variant-length long videos, prior works extract
multi-modal features and fuse them to predict students' engagement intensity.
In this paper, we present a new end-to-end method Class Attention in Video
Transformer (CavT), which involves a single vector to process class embedding
and to uniformly perform end-to-end learning on variant-length long videos and
fixed-length short videos. Furthermore, to address the lack of sufficient
samples, we propose a binary-order representatives sampling method (BorS) to
add multiple video sequences of each video to augment the training set.
BorS+CavT not only achieves the state-of-the-art MSE (0.0495) on the EmotiW-EP
dataset, but also obtains the state-of-the-art MSE (0.0377) on the DAiSEE
dataset. The code and models will be made publicly available at
https://github.com/mountainai/cavt.
Related papers
- TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents.
For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.
In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z) - Few-Shot Video Object Detection [70.43402912344327]
We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions.
FSVOD-500 comprises of 500 classes with class-balanced videos in each category for few-shot learning.
Our TPN and TMN+ are jointly and end-to-end trained.
arXiv Detail & Related papers (2021-04-30T07:38:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.