Multi-modal Segment Assemblage Network for Ad Video Editing with
Importance-Coherence Reward
- URL: http://arxiv.org/abs/2209.12164v1
- Date: Sun, 25 Sep 2022 06:51:45 GMT
- Title: Multi-modal Segment Assemblage Network for Ad Video Editing with
Importance-Coherence Reward
- Authors: Yunlong Tang, Siting Xu, Teng Wang, Qin Lin, Qinglin Lu, Feng Zheng
- Abstract summary: Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers.
Existing method performs well at video segmentation stages but suffers from dependencies on extra cumbersome models and poor performance at segment assemblage stage.
We propose M-SAN which can perform efficient and coherent segment assemblage task end-to-end.
- Score: 34.06878258459702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advertisement video editing aims to automatically edit advertising videos
into shorter videos while retaining coherent content and crucial information
conveyed by advertisers. It mainly contains two stages: video segmentation and
segment assemblage. The existing method performs well at video segmentation
stages but suffers from the problems of dependencies on extra cumbersome models
and poor performance at the segment assemblage stage. To address these
problems, we propose M-SAN (Multi-modal Segment Assemblage Network) which can
perform efficient and coherent segment assemblage task end-to-end. It utilizes
multi-modal representation extracted from the segments and follows the
Encoder-Decoder Ptr-Net framework with the Attention mechanism.
Importance-coherence reward is designed for training M-SAN. We experiment on
the Ads-1k dataset with 1000+ videos under rich ad scenarios collected from
advertisers. To evaluate the methods, we propose a unified metric,
Imp-Coh@Time, which comprehensively assesses the importance, coherence, and
duration of the outputs at the same time. Experimental results show that our
method achieves better performance than random selection and the previous
method on the metric. Ablation experiments further verify that multi-modal
representation and importance-coherence reward significantly improve the
performance. Ads-1k dataset is available at:
https://github.com/yunlong10/Ads-1k
Related papers
- X-SAM: From Segment Anything to Any Segmentation [63.79182974315084]
Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding.<n>We present X-SAM, a streamlined Multimodal Large Language Model framework that extends the segmentation paradigm from textitsegment anything to textitany segmentation.<n>We propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities.
arXiv Detail & Related papers (2025-08-06T17:19:10Z) - ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts [64.93416171745693]
Reasoning Video Object is a challenging task, which generates a mask sequence from an input video and an implicit, complex text query.<n>Existing works probe into the problem by finetuning Multimodal Large Language Models (MLLM) for segmentation-based output, while still falling short in difficult cases on videos given temporally-sensitive queries.<n>We propose ThinkVideo, a novel framework which leverages the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these challenges.
arXiv Detail & Related papers (2025-05-24T07:01:31Z) - Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering [36.94781787191615]
We propose a simple yet effective approach with active moment discovering (AMDNet)
We are committed to discovering video moments that are semantically consistent with their queries.
Experiments on two large-scale video datasets demonstrate the superiority and efficiency of our AMDNet.
arXiv Detail & Related papers (2025-04-15T07:00:18Z) - Multi-Granularity Video Object Segmentation [36.06127939037613]
We propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset.
We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation.
In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset.
arXiv Detail & Related papers (2024-12-02T13:17:41Z) - Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs.
We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z) - BURST: A Benchmark for Unifying Object Recognition, Segmentation and
Tracking in Video [58.71785546245467]
Multiple existing benchmarks involve tracking and segmenting objects in video.
There is little interaction between them due to the use of disparate benchmark datasets and metrics.
We propose BURST, a dataset which contains thousands of diverse videos with high-quality object masks.
All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison.
arXiv Detail & Related papers (2022-09-25T01:27:35Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Multi-modal Representation Learning for Video Advertisement Content
Structuring [10.45050088240847]
Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions.
Video advertisements contain sufficient and useful multi-modal content like caption and speech.
We propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text.
arXiv Detail & Related papers (2021-09-04T09:08:29Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - MINI-Net: Multiple Instance Ranking Network for Video Highlight
Detection [71.02649475990889]
We propose casting weakly supervised video highlight detection modeling for a given specific event as a multiple instance ranking network (MINI-Net) learning.
MINI-Net learns to enforce a higher highlight score for a positive bag that contains highlight segments of a specific event than those for negative bags that are irrelevant.
arXiv Detail & Related papers (2020-07-20T01:56:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.