Related papers: Multi-modal Representation Learning for Video Advertisement Content Structuring

Multi-modal Representation Learning for Video Advertisement Content Structuring

URL: http://arxiv.org/abs/2109.06637v1
Date: Sat, 4 Sep 2021 09:08:29 GMT
Title: Multi-modal Representation Learning for Video Advertisement Content Structuring
Authors: Daya Guo and Zhaoyang Zeng
Abstract summary: Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions. Video advertisements contain sufficient and useful multi-modal content like caption and speech. We propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text.
Score: 10.45050088240847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions, such as presentation form, scene, and style. Different from real-life videos, video advertisements contain sufficient and useful multi-modal content like caption and speech, which provides crucial video semantics and would enhance the structuring process. In this paper, we propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text. Based on multi-modal representation, we then apply Boundary-Matching Network to generate temporal proposals. To make the proposals more accurate, we refine generated proposals by scene-guided alignment and re-ranking. Finally, we incorporate proposal located embeddings into the introduced multi-modal encoder to capture temporal relationships between local features of each proposal and global features of the whole video for classification. Experimental results show that our method achieves significantly improvement compared with several baselines and Rank 1 on the task of Multi-modal Ads Video Understanding in ACM Multimedia 2021 Grand Challenge. Ablation study further shows that leveraging multi-modal content like caption and speech in video advertisements significantly improve the performance.

Related papers

ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual Advertising [2.330164376631038]
Contextual advertising serves ads that are aligned to the content that the user is viewing. Current text-to-video retrieval models based on joint multimodal training demand large datasets and computational resources. We introduce ContextIQ, a multimodal expert-based video retrieval system designed specifically for contextual advertising.
arXiv Detail & Related papers (2024-10-29T17:01:05Z)
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions. We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos. Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos. We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames. Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z)
MM-AU:Towards Multimodal Understanding of Advertisement Videos [38.117243603403175]
We introduce a multimodal multilingual benchmark called MM-AU composed of over 8.4K videos (147 hours) curated from multiple web sources. We explore multiple zero-shot reasoning baselines through the application of large language models on the ads transcripts.
arXiv Detail & Related papers (2023-08-27T09:11:46Z)
Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query. Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z)
A Multimodal Framework for Video Ads Understanding [64.70769354696019]
We develop a multimodal system to improve the ability of structured analysis of advertising video content. Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.
arXiv Detail & Related papers (2021-08-29T16:06:00Z)
Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)
Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
We present a way to learn a compact multimodal feature representation that encodes all these modalities. Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline. We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
arXiv Detail & Related papers (2020-04-05T14:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.