A Multimodal Framework for Video Ads Understanding
- URL: http://arxiv.org/abs/2108.12868v1
- Date: Sun, 29 Aug 2021 16:06:00 GMT
- Title: A Multimodal Framework for Video Ads Understanding
- Authors: Zejia Weng, Lingchen Meng, Rui Wang, Zuxuan Wu, Yu-Gang Jiang
- Abstract summary: We develop a multimodal system to improve the ability of structured analysis of advertising video content.
Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.
- Score: 64.70769354696019
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is a growing trend in placing video advertisements on social platforms
for online marketing, which demands automatic approaches to understand the
contents of advertisements effectively. Taking the 2021 TAAC competition as an
opportunity, we developed a multimodal system to improve the ability of
structured analysis of advertising video content. In our framework, we break
down the video structuring analysis problem into two tasks, i.e., scene
segmentation and multi-modal tagging. In scene segmentation, we build upon a
temporal convolution module for temporal modeling to predict whether adjacent
frames belong to the same scene. In multi-modal tagging, we first compute
clip-level visual features by aggregating frame-level features with
NeXt-SoftDBoF. The visual features are further complemented with textual
features that are derived using a global-local attention mechanism to extract
useful information from OCR (Optical Character Recognition) and ASR (Audio
Speech Recognition) outputs. Our solution achieved a score of 0.2470 measured
in consideration of localization and prediction accuracy, ranking fourth in the
2021 TAAC final leaderboard.
Related papers
- MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z) - Semi-supervised and Deep learning Frameworks for Video Classification
and Key-frame Identification [1.2335698325757494]
We present two semi-supervised approaches that automatically classify scenes for content and filter frames for scene understanding tasks.
The proposed framework can be scaled to additional video data streams for automated training of perception-driven systems.
arXiv Detail & Related papers (2022-03-25T05:45:18Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - Multi-modal Representation Learning for Video Advertisement Content
Structuring [10.45050088240847]
Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions.
Video advertisements contain sufficient and useful multi-modal content like caption and speech.
We propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text.
arXiv Detail & Related papers (2021-09-04T09:08:29Z) - DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video
Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS.
DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence.
We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.