Overview of Tencent Multi-modal Ads Video Understanding Challenge
- URL: http://arxiv.org/abs/2109.07951v1
- Date: Thu, 16 Sep 2021 13:07:08 GMT
- Title: Overview of Tencent Multi-modal Ads Video Understanding Challenge
- Authors: Zhenzhi Wang, Liyu Wu, Zhimin Li, Jiangfeng Xiong, Qinglin Lu
- Abstract summary: Multi-modal Ads Video Understanding Challenge is the first grand challenge aiming to comprehensively understand ads videos.
It asks the participants to accurately predict both the scene boundaries and the multi-label categories of each scene.
It will advance the foundation of ads video understanding and have a significant impact on many ads applications like video recommendation.
- Score: 1.6904374000330984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal Ads Video Understanding Challenge is the first grand challenge
aiming to comprehensively understand ads videos. Our challenge includes two
tasks: video structuring in the temporal dimension and multi-modal video
classification. It asks the participants to accurately predict both the scene
boundaries and the multi-label categories of each scene based on a fine-grained
and ads-related category hierarchy. Therefore, our task has four distinguishing
features from previous ones: ads domain, multi-modal information, temporal
segmentation, and multi-label classification. It will advance the foundation of
ads video understanding and have a significant impact on many ads applications
like video recommendation. This paper presents an overview of our challenge,
including the background of ads videos, an elaborate description of task and
dataset, evaluation protocol, and our proposed baseline. By ablating the key
components of our baseline, we would like to reveal the main challenges of this
task and provide useful guidance for future research of this area. In this
paper, we give an extended version of our challenge overview. The dataset will
be publicly available at https://algo.qq.com/.
Related papers
- Subject-Oriented Video Captioning [64.08594243670296]
We propose a new video captioning task, subject-oriented video captioning, which allows users to specify the describing target via a bounding box.
We construct two subject-oriented video captioning datasets based on two widely used video captioning datasets: MSVD and MSRVTT.
As the first attempt, we evaluate four state-of-the-art general video captioning models, and have observed a large performance drop.
arXiv Detail & Related papers (2023-12-20T17:44:32Z) - Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain
Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos.
We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames.
Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z) - Multi-queue Momentum Contrast for Microvideo-Product Retrieval [57.527227171945796]
We formulate the microvideo-product retrieval task, which is the first attempt to explore the retrieval between the multi-modal and multi-modal instances.
A novel approach named Multi-Queue Momentum Contrast (MQMC) network is proposed for bidirectional retrieval.
A discriminative selection strategy with a multi-queue is used to distinguish the importance of different negatives based on their categories.
arXiv Detail & Related papers (2022-12-22T03:47:14Z) - Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene
Segmentation [12.104032818304745]
We construct the Tencent Ads Video'(TAVS) dataset in the ads domain to escalate multi-modal video analysis to a new level.
TAVS describes videos from three independent perspectives as presentation form', place', and style', and contains rich multi-modal information such as video, audio, and text.
It includes 12,000 videos, 82 classes, 33,900 segments, 121,100 shots, and 168,500 labels.
arXiv Detail & Related papers (2022-12-09T07:26:20Z) - Multi-modal Segment Assemblage Network for Ad Video Editing with
Importance-Coherence Reward [34.06878258459702]
Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers.
Existing method performs well at video segmentation stages but suffers from dependencies on extra cumbersome models and poor performance at segment assemblage stage.
We propose M-SAN which can perform efficient and coherent segment assemblage task end-to-end.
arXiv Detail & Related papers (2022-09-25T06:51:45Z) - Multi-modal Representation Learning for Video Advertisement Content
Structuring [10.45050088240847]
Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions.
Video advertisements contain sufficient and useful multi-modal content like caption and speech.
We propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text.
arXiv Detail & Related papers (2021-09-04T09:08:29Z) - Video Ads Content Structuring by Combining Scene Confidence Prediction
and Tagging [10.609715843964263]
We propose a two-stage method that first provides the boundaries of the scenes and then combines a confidence score for each segmented scene and the tag classes predicted for that scene.
Our combined method improves the previous baselines on the challenging "Tencent Advertisement Video" dataset.
arXiv Detail & Related papers (2021-08-20T15:13:20Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.