Related papers: Overview of Tencent Multi-modal Ads Video Understanding Challenge

Overview of Tencent Multi-modal Ads Video Understanding Challenge

URL: http://arxiv.org/abs/2109.07951v1
Date: Thu, 16 Sep 2021 13:07:08 GMT
Title: Overview of Tencent Multi-modal Ads Video Understanding Challenge
Authors: Zhenzhi Wang, Liyu Wu, Zhimin Li, Jiangfeng Xiong, Qinglin Lu
Abstract summary: Multi-modal Ads Video Understanding Challenge is the first grand challenge aiming to comprehensively understand ads videos. It asks the participants to accurately predict both the scene boundaries and the multi-label categories of each scene. It will advance the foundation of ads video understanding and have a significant impact on many ads applications like video recommendation.
Score: 1.6904374000330984
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal Ads Video Understanding Challenge is the first grand challenge aiming to comprehensively understand ads videos. Our challenge includes two tasks: video structuring in the temporal dimension and multi-modal video classification. It asks the participants to accurately predict both the scene boundaries and the multi-label categories of each scene based on a fine-grained and ads-related category hierarchy. Therefore, our task has four distinguishing features from previous ones: ads domain, multi-modal information, temporal segmentation, and multi-label classification. It will advance the foundation of ads video understanding and have a significant impact on many ads applications like video recommendation. This paper presents an overview of our challenge, including the background of ads videos, an elaborate description of task and dataset, evaluation protocol, and our proposed baseline. By ablating the key components of our baseline, we would like to reveal the main challenges of this task and provide useful guidance for future research of this area. In this paper, we give an extended version of our challenge overview. The dataset will be publicly available at https://algo.qq.com/.

Related papers

SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [80.3895950009792]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z)
Towards Long Video Understanding via Fine-detailed Video Story Generation [58.31050916006673]
Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval. Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy. We introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations.
arXiv Detail & Related papers (2024-12-09T03:41:28Z)
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval [57.891157692501345]
$textbfMultiVENT 2.0$ is a large-scale, multilingual event-centric video retrieval benchmark. It features a collection of more than 218,000 news videos and 3,906 queries targeting specific world events. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task.
arXiv Detail & Related papers (2024-10-15T13:56:34Z)
Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos [58.53311308617818]
We present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and question-answering pairs. Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos. The generated imperfect summaries can already achieve competitive performance on existing video understanding tasks.
arXiv Detail & Related papers (2023-12-16T03:17:30Z)
Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos. We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames. Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z)
Multi-queue Momentum Contrast for Microvideo-Product Retrieval [57.527227171945796]
We formulate the microvideo-product retrieval task, which is the first attempt to explore the retrieval between the multi-modal and multi-modal instances. A novel approach named Multi-Queue Momentum Contrast (MQMC) network is proposed for bidirectional retrieval. A discriminative selection strategy with a multi-queue is used to distinguish the importance of different negatives based on their categories.
arXiv Detail & Related papers (2022-12-22T03:47:14Z)
Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene Segmentation [12.104032818304745]
We construct the Tencent Ads Video'(TAVS) dataset in the ads domain to escalate multi-modal video analysis to a new level. TAVS describes videos from three independent perspectives as presentation form', place', and style', and contains rich multi-modal information such as video, audio, and text. It includes 12,000 videos, 82 classes, 33,900 segments, 121,100 shots, and 168,500 labels.
arXiv Detail & Related papers (2022-12-09T07:26:20Z)
Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward [34.06878258459702]
Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers. Existing method performs well at video segmentation stages but suffers from dependencies on extra cumbersome models and poor performance at segment assemblage stage. We propose M-SAN which can perform efficient and coherent segment assemblage task end-to-end.
arXiv Detail & Related papers (2022-09-25T06:51:45Z)
Multi-modal Representation Learning for Video Advertisement Content Structuring [10.45050088240847]
Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions. Video advertisements contain sufficient and useful multi-modal content like caption and speech. We propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text.
arXiv Detail & Related papers (2021-09-04T09:08:29Z)
Video Ads Content Structuring by Combining Scene Confidence Prediction and Tagging [10.609715843964263]
We propose a two-stage method that first provides the boundaries of the scenes and then combines a confidence score for each segmented scene and the tag classes predicted for that scene. Our combined method improves the previous baselines on the challenging "Tencent Advertisement Video" dataset.
arXiv Detail & Related papers (2021-08-20T15:13:20Z)
A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications. Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.