Related papers: Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene Segmentation

Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene Segmentation

URL: http://arxiv.org/abs/2212.04700v1
Date: Fri, 9 Dec 2022 07:26:20 GMT
Title: Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene Segmentation
Authors: Jie Jiang, Zhimin Li, Jiangfeng Xiong, Rongwei Quan, Qinglin Lu, Wei Liu
Abstract summary: We construct the Tencent Ads Video'(TAVS) dataset in the ads domain to escalate multi-modal video analysis to a new level. TAVS describes videos from three independent perspectives as presentation form', place', and style', and contains rich multi-modal information such as video, audio, and text. It includes 12,000 videos, 82 classes, 33,900 segments, 121,100 shots, and 168,500 labels.
Score: 12.104032818304745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Temporal video segmentation and classification have been advanced greatly by public benchmarks in recent years. However, such research still mainly focuses on human actions, failing to describe videos in a holistic view. In addition, previous research tends to pay much attention to visual information yet ignores the multi-modal nature of videos. To fill this gap, we construct the Tencent `Ads Video Segmentation'~(TAVS) dataset in the ads domain to escalate multi-modal video analysis to a new level. TAVS describes videos from three independent perspectives as `presentation form', `place', and `style', and contains rich multi-modal information such as video, audio, and text. TAVS is organized hierarchically in semantic aspects for comprehensive temporal video segmentation with three levels of categories for multi-label classification, e.g., `place' - `working place' - `office'. Therefore, TAVS is distinguished from previous temporal segmentation datasets due to its multi-modal information, holistic view of categories, and hierarchical granularities. It includes 12,000 videos, 82 classes, 33,900 segments, 121,100 shots, and 168,500 labels. Accompanied with TAVS, we also present a strong multi-modal video segmentation baseline coupled with multi-label class prediction. Extensive experiments are conducted to evaluate our proposed method as well as existing representative methods to reveal key challenges of our dataset TAVS.

Related papers

Enhancing Multi-Modal Video Sentiment Classification Through Semi-Supervised Clustering [0.0]
We aim to improve video sentiment classification by focusing on two key aspects: the video itself, the accompanying text, and the acoustic features. We are developing a method that utilizes clustering-based semi-supervised pre-training to extract meaningful representations from the data.
arXiv Detail & Related papers (2025-01-11T08:04:39Z)
Agent-based Video Trimming [17.519404251018308]
We introduce a novel task called Video Trimming (VT) VT focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story. AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task.
arXiv Detail & Related papers (2024-12-12T17:59:28Z)
Towards Open-Vocabulary Video Semantic Segmentation [40.58291642595943]
We introduce the Open Vocabulary Video Semantic (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context.
arXiv Detail & Related papers (2024-12-12T14:53:16Z)
Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup. We introduce a unified audio-visual few-shot video classification benchmark on three datasets. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z)
Towards Open-Vocabulary Video Instance Segmentation [61.469232166803465]
Video Instance aims at segmenting and categorizing objects in videos from a closed set of training categories. We introduce the novel task of Open-Vocabulary Video Instance, which aims to simultaneously segment, track, and classify objects in videos from open-set categories. To benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories.
arXiv Detail & Related papers (2023-04-04T11:25:23Z)
Unsupervised Audio-Visual Lecture Segmentation [31.29084124332193]
We introduce AVLectures, a dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects. Our second contribution is introducing video lecture segmentation that splits lectures into bite-sized topics that show promise in improving learner engagement. We use these representations to generate segments using a temporally consistent 1-nearest neighbor algorithm, TW-FINCH.
arXiv Detail & Related papers (2022-10-29T16:26:34Z)
Boosting Video Representation Learning with Multi-Faceted Integration [112.66127428372089]
Video content is multifaceted, consisting of objects, scenes, interactions or actions. Existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset. We propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content.
arXiv Detail & Related papers (2022-01-11T16:14:23Z)
Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos. We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z)
A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications. Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z)
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. We evaluate various baseline methods with and without large-scale VidL pre-training. The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z)
Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation. To invigorate research on this new task, we present two types of video panoptic datasets. We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.