Exploiting Temporal Coherence for Multi-modal Video Categorization
- URL: http://arxiv.org/abs/2002.03844v2
- Date: Sat, 6 Jun 2020 00:17:11 GMT
- Title: Exploiting Temporal Coherence for Multi-modal Video Categorization
- Authors: Palash Goyal, Saurabh Sahu, Shalini Ghosh, Chul Lee
- Abstract summary: In this paper, we focus on the problem of video categorization by using a multimodal approach.
We have developed a novel temporal coherence-based regularization approach, which applies to different types of models.
We demonstrate through experiments how our proposed multimodal video categorization models with temporal coherence out-perform strong state-of-the-art baseline models.
- Score: 24.61762520189921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal ML models can process data in multiple modalities (e.g., video,
images, audio, text) and are useful for video content analysis in a variety of
problems (e.g., object detection, scene understanding). In this paper, we focus
on the problem of video categorization by using a multimodal approach. We have
developed a novel temporal coherence-based regularization approach, which
applies to different types of models (e.g., RNN, NetVLAD, Transformer). We
demonstrate through experiments how our proposed multimodal video
categorization models with temporal coherence out-perform strong
state-of-the-art baseline models.
Related papers
- Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution.
We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - Learning multi-modal generative models with permutation-invariant encoders and tighter variational bounds [5.549794481031468]
Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research.
In this work, we consider a variational bound that can tightly approximate the data log-likelihood.
We develop more flexible aggregation schemes that generalize PoE or MoE approaches by combining encoded features from different modalities based on permutation-invariant neural networks.
arXiv Detail & Related papers (2023-09-01T10:32:21Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - VIDM: Video Implicit Diffusion Models [75.90225524502759]
Diffusion models have emerged as a powerful generative method for synthesizing high-quality and diverse set of images.
We propose a video generation method based on diffusion models, where the effects of motion are modeled in an implicit condition.
We improve the quality of the generated videos by proposing multiple strategies such as sampling space truncation, robustness penalty, and positional group normalization.
arXiv Detail & Related papers (2022-12-01T02:58:46Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Frame Aggregation and Multi-Modal Fusion Framework for Video-Based
Person Recognition [13.875674649636874]
We propose a Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition.
FAMF aggregates face features and incorporates them with multi-modal information to identify persons in videos.
We show that introducing an attention mechanism to NetVLAD can effectively decrease the impact of low-quality frames.
arXiv Detail & Related papers (2020-10-19T08:06:40Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z) - Cross-modal Learning for Multi-modal Video Categorization [24.61762520189921]
Multi-modal machine learning (ML) models can process data in multiple modalities.
In this paper, we focus on the problem of video categorization using a multi-modal ML technique.
We show how our proposed multi-modal video categorization models with cross-modal learning out-perform strong state-of-the-art baseline models.
arXiv Detail & Related papers (2020-03-07T03:21:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.