The Potential of Vision-Language Models for Content Moderation of
Children's Videos
- URL: http://arxiv.org/abs/2312.03936v1
- Date: Wed, 6 Dec 2023 22:29:16 GMT
- Title: The Potential of Vision-Language Models for Content Moderation of
Children's Videos
- Authors: Syed Hammad Ahmed, Shengnan Hu, Gita Sukthankar
- Abstract summary: This paper presents an in depth analysis of how context-specific language prompts affect content moderation performance.
It is important to include more context in content moderation prompts, particularly for cartoon videos.
- Score: 1.0589208420411014
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Natural language supervision has been shown to be effective for zero-shot
learning in many computer vision tasks, such as object detection and activity
recognition. However, generating informative prompts can be challenging for
more subtle tasks, such as video content moderation. This can be difficult, as
there are many reasons why a video might be inappropriate, beyond violence and
obscenity. For example, scammers may attempt to create junk content that is
similar to popular educational videos but with no meaningful information. This
paper evaluates the performance of several CLIP variations for content
moderation of children's cartoons in both the supervised and zero-shot setting.
We show that our proposed model (Vanilla CLIP with Projection Layer)
outperforms previous work conducted on the Malicious or Benign (MOB) benchmark
for video content moderation. This paper presents an in depth analysis of how
context-specific language prompts affect content moderation performance. Our
results indicate that it is important to include more context in content
moderation prompts, particularly for cartoon videos as they are not well
represented in the CLIP training data.
Related papers
- Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion [0.6963971634605796]
We present an efficient adaptation of CLIP that can leverage contextual audio cues for enhanced content moderation.
We conduct experiments on a multimodal version of the MOB (Malicious or Benign) dataset in supervised and few-shot settings.
arXiv Detail & Related papers (2024-05-09T22:19:40Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Deep Architectures for Content Moderation and Movie Content Rating [3.04585143845864]
Movie content rating and TV show rating are the two most common rating systems established by professional committees.
A desirable solution is to use computer vision based video content analysis techniques to automate the evaluation process.
In this paper, related works are summarized for action recognition, multi-modal learning, movie genre classification, and sensitive content detection.
arXiv Detail & Related papers (2022-12-08T19:50:53Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - CLUE: Contextualised Unified Explainable Learning of User Engagement in
Video Lectures [6.25256391074865]
We propose a new unified model, CLUE, which learns from the features extracted from public online teaching videos.
Our model exploits various multi-modal features to model the complexity of language, context information, textual emotion of the delivered content.
arXiv Detail & Related papers (2022-01-14T19:51:06Z) - Transcript to Video: Efficient Clip Sequencing from Texts [65.87890762420922]
We present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots.
Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles.
For fast inference, we introduce an efficient search strategy for real-time video clip sequencing.
arXiv Detail & Related papers (2021-07-25T17:24:50Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.