Multi-modal Representation Learning for Video Advertisement Content
Structuring
- URL: http://arxiv.org/abs/2109.06637v1
- Date: Sat, 4 Sep 2021 09:08:29 GMT
- Title: Multi-modal Representation Learning for Video Advertisement Content
Structuring
- Authors: Daya Guo and Zhaoyang Zeng
- Abstract summary: Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions.
Video advertisements contain sufficient and useful multi-modal content like caption and speech.
We propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text.
- Score: 10.45050088240847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video advertisement content structuring aims to segment a given video
advertisement and label each segment on various dimensions, such as
presentation form, scene, and style. Different from real-life videos, video
advertisements contain sufficient and useful multi-modal content like caption
and speech, which provides crucial video semantics and would enhance the
structuring process. In this paper, we propose a multi-modal encoder to learn
multi-modal representation from video advertisements by interacting between
video-audio and text. Based on multi-modal representation, we then apply
Boundary-Matching Network to generate temporal proposals. To make the proposals
more accurate, we refine generated proposals by scene-guided alignment and
re-ranking. Finally, we incorporate proposal located embeddings into the
introduced multi-modal encoder to capture temporal relationships between local
features of each proposal and global features of the whole video for
classification. Experimental results show that our method achieves
significantly improvement compared with several baselines and Rank 1 on the
task of Multi-modal Ads Video Understanding in ACM Multimedia 2021 Grand
Challenge. Ablation study further shows that leveraging multi-modal content
like caption and speech in video advertisements significantly improve the
performance.
Related papers
- VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain
Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos.
We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames.
Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z) - MM-AU:Towards Multimodal Understanding of Advertisement Videos [38.117243603403175]
We introduce a multimodal multilingual benchmark called MM-AU composed of over 8.4K videos (147 hours) curated from multiple web sources.
We explore multiple zero-shot reasoning baselines through the application of large language models on the ads transcripts.
arXiv Detail & Related papers (2023-08-27T09:11:46Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - Multi-modal Segment Assemblage Network for Ad Video Editing with
Importance-Coherence Reward [34.06878258459702]
Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers.
Existing method performs well at video segmentation stages but suffers from dependencies on extra cumbersome models and poor performance at segment assemblage stage.
We propose M-SAN which can perform efficient and coherent segment assemblage task end-to-end.
arXiv Detail & Related papers (2022-09-25T06:51:45Z) - MHMS: Multimodal Hierarchical Multimedia Summarization [80.18786847090522]
We propose a multimodal hierarchical multimedia summarization (MHMS) framework by interacting visual and language domains.
Our method contains video and textual segmentation and summarization module, respectively.
It formulates a cross-domain alignment objective with optimal transport distance to generate the representative and textual summary.
arXiv Detail & Related papers (2022-04-07T21:00:40Z) - A Multimodal Framework for Video Ads Understanding [64.70769354696019]
We develop a multimodal system to improve the ability of structured analysis of advertising video content.
Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.
arXiv Detail & Related papers (2021-08-29T16:06:00Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z) - Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
We present a way to learn a compact multimodal feature representation that encodes all these modalities.
Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline.
We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
arXiv Detail & Related papers (2020-04-05T14:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.