Multi-modal Segment Assemblage Network for Ad Video Editing with
Importance-Coherence Reward
- URL: http://arxiv.org/abs/2209.12164v1
- Date: Sun, 25 Sep 2022 06:51:45 GMT
- Title: Multi-modal Segment Assemblage Network for Ad Video Editing with
Importance-Coherence Reward
- Authors: Yunlong Tang, Siting Xu, Teng Wang, Qin Lin, Qinglin Lu, Feng Zheng
- Abstract summary: Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers.
Existing method performs well at video segmentation stages but suffers from dependencies on extra cumbersome models and poor performance at segment assemblage stage.
We propose M-SAN which can perform efficient and coherent segment assemblage task end-to-end.
- Score: 34.06878258459702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advertisement video editing aims to automatically edit advertising videos
into shorter videos while retaining coherent content and crucial information
conveyed by advertisers. It mainly contains two stages: video segmentation and
segment assemblage. The existing method performs well at video segmentation
stages but suffers from the problems of dependencies on extra cumbersome models
and poor performance at the segment assemblage stage. To address these
problems, we propose M-SAN (Multi-modal Segment Assemblage Network) which can
perform efficient and coherent segment assemblage task end-to-end. It utilizes
multi-modal representation extracted from the segments and follows the
Encoder-Decoder Ptr-Net framework with the Attention mechanism.
Importance-coherence reward is designed for training M-SAN. We experiment on
the Ads-1k dataset with 1000+ videos under rich ad scenarios collected from
advertisers. To evaluate the methods, we propose a unified metric,
Imp-Coh@Time, which comprehensively assesses the importance, coherence, and
duration of the outputs at the same time. Experimental results show that our
method achieves better performance than random selection and the previous
method on the metric. Ablation experiments further verify that multi-modal
representation and importance-coherence reward significantly improve the
performance. Ads-1k dataset is available at:
https://github.com/yunlong10/Ads-1k
Related papers
- ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - OMG-Seg: Is One Model Good Enough For All Segmentation? [86.29839352757922]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs.
We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z) - BURST: A Benchmark for Unifying Object Recognition, Segmentation and
Tracking in Video [58.71785546245467]
Multiple existing benchmarks involve tracking and segmenting objects in video.
There is little interaction between them due to the use of disparate benchmark datasets and metrics.
We propose BURST, a dataset which contains thousands of diverse videos with high-quality object masks.
All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison.
arXiv Detail & Related papers (2022-09-25T01:27:35Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Multi-modal Representation Learning for Video Advertisement Content
Structuring [10.45050088240847]
Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions.
Video advertisements contain sufficient and useful multi-modal content like caption and speech.
We propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text.
arXiv Detail & Related papers (2021-09-04T09:08:29Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - MINI-Net: Multiple Instance Ranking Network for Video Highlight
Detection [71.02649475990889]
We propose casting weakly supervised video highlight detection modeling for a given specific event as a multiple instance ranking network (MINI-Net) learning.
MINI-Net learns to enforce a higher highlight score for a positive bag that contains highlight segments of a specific event than those for negative bags that are irrelevant.
arXiv Detail & Related papers (2020-07-20T01:56:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.