MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic
Video Segmentation
- URL: http://arxiv.org/abs/2308.11185v1
- Date: Tue, 22 Aug 2023 04:23:59 GMT
- Title: MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic
Video Segmentation
- Authors: Najmeh Sadoughi, Xinyu Li, Avijit Vajpayee, David Fan, Bing Shuai,
Hector Santos-Villalobos, Vimal Bhat, Rohith MV
- Abstract summary: We introduce Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic long-video segmentation.
The method coarsely aligns inputs of variable lengths and different modalities with alignment positional encoding.
MEGA employs a novel contrastive loss to synchronize and transfer labels across modalities, enabling act segmentation from labeled synopsis sentences on video shots.
- Score: 10.82074185158027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous research has studied the task of segmenting cinematic videos into
scenes and into narrative acts. However, these studies have overlooked the
essential task of multimodal alignment and fusion for effectively and
efficiently processing long-form videos (>60min). In this paper, we introduce
Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic
long-video segmentation. MEGA tackles the challenge by leveraging multiple
media modalities. The method coarsely aligns inputs of variable lengths and
different modalities with alignment positional encoding. To maintain temporal
synchronization while reducing computation, we further introduce an enhanced
bottleneck fusion layer which uses temporal alignment. Additionally, MEGA
employs a novel contrastive loss to synchronize and transfer labels across
modalities, enabling act segmentation from labeled synopsis sentences on video
shots. Our experimental results show that MEGA outperforms state-of-the-art
methods on MovieNet dataset for scene segmentation (with an Average Precision
improvement of +1.19%) and on TRIPOD dataset for act segmentation (with a Total
Agreement improvement of +5.51%)
Related papers
- MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling [19.004339956475498]
MAVIN is designed to generate transition videos that seamlessly connect two given videos, forming a cohesive integrated sequence.
We introduce a new metric, CLIP-RS (CLIP Relative Smoothness), to evaluate temporal coherence and smoothness, complementing traditional quality-based metrics.
Experimental results on horse and tiger scenarios demonstrate MAVIN's superior performance in generating smooth and coherent video transitions.
arXiv Detail & Related papers (2024-05-28T09:46:09Z) - Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment [33.74853437611066]
Weakly-supervised action segmentation is a task of learning to partition a long video into several action segments, where training videos are only accompanied by transcripts.
Most of existing methods need to infer pseudo segmentation for training by serial alignment between all frames and the transcript.
We propose a novel Action-Transition-Aware Boundary Alignment framework to efficiently and effectively filter out noisy boundaries and detect transitions.
arXiv Detail & Related papers (2024-03-28T08:39:44Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Unified Fully and Timestamp Supervised Temporal Action Segmentation via
Sequence to Sequence Translation [15.296933526770967]
This paper introduces a unified framework for video action segmentation via sequence to sequence (seq2seq) translation.
Our proposed method involves a series of modifications and auxiliary loss functions on the standard Transformer seq2seq translation model.
Our framework performs consistently on both fully and timestamp supervised settings, outperforming or competing state-of-the-art on several datasets.
arXiv Detail & Related papers (2022-09-01T17:46:02Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - An End-to-End Trainable Video Panoptic Segmentation Method
usingTransformers [0.11714813224840924]
We present an algorithm to tackle a video panoptic segmentation problem, a newly emerging area of research.
Our proposed video panoptic segmentation algorithm uses the transformer and it can be trained in end-to-end with an input of multiple video frames.
The method archived 57.81% on the KITTI-STEP dataset and 31.8% on the MOTChallenge-STEP dataset.
arXiv Detail & Related papers (2021-10-08T10:13:37Z) - Few-Shot Action Recognition with Compromised Metric via Optimal
Transport [31.834843714684343]
Few-shot action recognition is still not mature despite the wide research of few-shot image classification.
One main obstacle to applying these algorithms in action recognition is the complex structure of videos.
We propose Compromised Metric via Optimal Transport (CMOT) to combine the advantages of these two solutions.
arXiv Detail & Related papers (2021-04-08T12:42:05Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z) - Learning Motion Flows for Semi-supervised Instrument Segmentation from
Robotic Surgical Video [64.44583693846751]
We study the semi-supervised instrument segmentation from robotic surgical videos with sparse annotations.
By exploiting generated data pairs, our framework can recover and even enhance temporal consistency of training sequences.
Results show that our method outperforms the state-of-the-art semisupervised methods by a large margin.
arXiv Detail & Related papers (2020-07-06T02:39:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.