Video Ads Content Structuring by Combining Scene Confidence Prediction
and Tagging
- URL: http://arxiv.org/abs/2108.09215v1
- Date: Fri, 20 Aug 2021 15:13:20 GMT
- Title: Video Ads Content Structuring by Combining Scene Confidence Prediction
and Tagging
- Authors: Tomoyuki Suzuki and Antonio Tejero-de-Pablos
- Abstract summary: We propose a two-stage method that first provides the boundaries of the scenes and then combines a confidence score for each segmented scene and the tag classes predicted for that scene.
Our combined method improves the previous baselines on the challenging "Tencent Advertisement Video" dataset.
- Score: 10.609715843964263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video ads segmentation and tagging is a challenging task due to two main
reasons: (1) the video scene structure is complex and (2) it includes multiple
modalities (e.g., visual, audio, text.). While previous work focuses mostly on
activity videos (e.g. "cooking", "sports"), it is not clear how they can be
leveraged to tackle the task of video ads content structuring. In this paper,
we propose a two-stage method that first provides the boundaries of the scenes,
and then combines a confidence score for each segmented scene and the tag
classes predicted for that scene. We provide extensive experimental results on
the network architectures and modalities used for the proposed method. Our
combined method improves the previous baselines on the challenging "Tencent
Advertisement Video" dataset.
Related papers
- Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain
Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos.
We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames.
Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z) - Scene Summarization: Clustering Scene Videos into Spatially Diverse
Frames [24.614476456145255]
We propose summarization as a new video-based scene understanding task.
It aims to summarize a long video walkthrough of a scene into a small set of frames that are spatially diverse in the scene.
Our solution is a two-stage self-supervised pipeline named SceneSum.
arXiv Detail & Related papers (2023-11-28T22:18:26Z) - Multi-modal Segment Assemblage Network for Ad Video Editing with
Importance-Coherence Reward [34.06878258459702]
Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers.
Existing method performs well at video segmentation stages but suffers from dependencies on extra cumbersome models and poor performance at segment assemblage stage.
We propose M-SAN which can perform efficient and coherent segment assemblage task end-to-end.
arXiv Detail & Related papers (2022-09-25T06:51:45Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - Joint Multimedia Event Extraction from Video and Article [51.159034070824056]
We propose the first approach to jointly extract events from video and text articles.
First, we propose the first self-supervised multimodal event coreference model.
Second, we introduce the first multimodal transformer which extracts structured event information jointly from both videos and text documents.
arXiv Detail & Related papers (2021-09-27T03:22:12Z) - Overview of Tencent Multi-modal Ads Video Understanding Challenge [1.6904374000330984]
Multi-modal Ads Video Understanding Challenge is the first grand challenge aiming to comprehensively understand ads videos.
It asks the participants to accurately predict both the scene boundaries and the multi-label categories of each scene.
It will advance the foundation of ads video understanding and have a significant impact on many ads applications like video recommendation.
arXiv Detail & Related papers (2021-09-16T13:07:08Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - A Local-to-Global Approach to Multi-modal Movie Scene Segmentation [95.34033481442353]
We build a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies.
We propose a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie.
Our experiments show that the proposed network is able to segment a movie into scenes with high accuracy, consistently outperforming previous methods.
arXiv Detail & Related papers (2020-04-06T13:58:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.