Related papers: Exploiting Temporal Coherence for Multi-modal Video Categorization

Related papers

UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution [62.10676832966289]
Cascaded video super-resolution has emerged as a promising technique for generating high-resolution videos using large foundation models.<n>We present UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos.<n>Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions.
arXiv Detail & Related papers (2025-10-09T12:25:16Z)
MAGREF: Masked Guidance for Any-Reference Video Generation [33.35245169242822]
MAGREF is a unified framework for any-reference video generation.<n>We propose a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference.<n>Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios.
arXiv Detail & Related papers (2025-05-29T17:58:15Z)
Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation [97.82707398481273]
We develop a novel meta-learning-based multimodal fusion framework called Meta Multimodal Fusion (MetaMMF) Based on the meta information extracted from the multimodal features of the input task, MetaMMF parameterizes a neural network as the item-specific fusion function via a meta learner. We perform extensive experiments on three benchmark datasets, demonstrating the significant improvements over several state-of-the-art multimodal recommendation models.
arXiv Detail & Related papers (2025-01-13T07:51:43Z)
Learning Multimodal Latent Generative Models with Energy-Based Prior [3.6648642834198797]
We propose a novel framework that integrates the latent generative model with the EBM. This approach results in a more expressive and informative prior, better-capturing of information across multiple modalities.
arXiv Detail & Related papers (2024-09-30T01:38:26Z)
Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics. We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z)
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF) In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z)
Learning multi-modal generative models with permutation-invariant encoders and tighter variational objectives [5.549794481031468]
Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research. In this work, we consider a variational objective that can tightly approximate the data log-likelihood. We develop more flexible aggregation schemes that avoid the inductive biases in PoE or MoE approaches.
arXiv Detail & Related papers (2023-09-01T10:32:21Z)
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules. Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
VIDM: Video Implicit Diffusion Models [75.90225524502759]
Diffusion models have emerged as a powerful generative method for synthesizing high-quality and diverse set of images. We propose a video generation method based on diffusion models, where the effects of motion are modeled in an implicit condition. We improve the quality of the generated videos by proposing multiple strategies such as sampling space truncation, robustness penalty, and positional group normalization.
arXiv Detail & Related papers (2022-12-01T02:58:46Z)
Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis. Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z)
Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition [13.875674649636874]
We propose a Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition. FAMF aggregates face features and incorporates them with multi-modal information to identify persons in videos. We show that introducing an attention mechanism to NetVLAD can effectively decrease the impact of low-quality frames.
arXiv Detail & Related papers (2020-10-19T08:06:40Z)
Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
Cross-modal Learning for Multi-modal Video Categorization [24.61762520189921]
Multi-modal machine learning (ML) models can process data in multiple modalities. In this paper, we focus on the problem of video categorization using a multi-modal ML technique. We show how our proposed multi-modal video categorization models with cross-modal learning out-perform strong state-of-the-art baseline models.
arXiv Detail & Related papers (2020-03-07T03:21:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.