Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion
- URL: http://arxiv.org/abs/2405.06128v1
- Date: Thu, 9 May 2024 22:19:40 GMT
- Title: Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion
- Authors: Syed Hammad Ahmed, Muhammad Junaid Khan, Gita Sukthankar,
- Abstract summary: We present an efficient adaptation of CLIP that can leverage contextual audio cues for enhanced content moderation.
We conduct experiments on a multimodal version of the MOB (Malicious or Benign) dataset in supervised and few-shot settings.
- Score: 0.6963971634605796
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Due to the rise in video content creation targeted towards children, there is a need for robust content moderation schemes for video hosting platforms. A video that is visually benign may include audio content that is inappropriate for young children while being impossible to detect with a unimodal content moderation system. Popular video hosting platforms for children such as YouTube Kids still publish videos which contain audio content that is not conducive to a child's healthy behavioral and physical development. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. To address this, we present an efficient adaptation of CLIP (Contrastive Language-Image Pre-training) that can leverage contextual audio cues for enhanced content moderation. We incorporate 1) the audio modality and 2) prompt learning, while keeping the backbone modules of each modality frozen. We conduct our experiments on a multimodal version of the MOB (Malicious or Benign) dataset in supervised and few-shot settings.
Related papers
- MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation [43.35578187209748]
Foley audio faces significant challenges in the AI-generated content (AIGC) landscape.
Current text-to-audio technology relies on detailed and acoustically relevant textual descriptions.
We introduce the Multi-modal Image and Narrative Text Dubbing dataset (MINT)
MINT is designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing.
arXiv Detail & Related papers (2024-06-15T10:47:36Z) - Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training.
We propose a novel ambient-aware audio generation model, AV-LDM.
Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - The Potential of Vision-Language Models for Content Moderation of
Children's Videos [1.0589208420411014]
This paper presents an in depth analysis of how context-specific language prompts affect content moderation performance.
It is important to include more context in content moderation prompts, particularly for cartoon videos.
arXiv Detail & Related papers (2023-12-06T22:29:16Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Malicious or Benign? Towards Effective Content Moderation for Children's
Videos [1.0323063834827415]
This paper introduces our toolkit Malicious or Benign for promoting research on automated content moderation of children's videos.
We present 1) a customizable annotation tool for videos, 2) a new dataset with difficult to detect test cases of malicious content, and 3) a benchmark suite of state-of-the-art video classification models.
arXiv Detail & Related papers (2023-05-24T20:33:38Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - 'Beach' to 'Bitch': Inadvertent Unsafe Transcription of Kids' Content on
YouTube [13.116806430326513]
Well-known automatic speech recognition (ASR) systems may produce text content highly inappropriate for kids while transcribing YouTube Kids' videos.
We release a first-of-its-kind data set of audios for which the existing state-of-the-art ASR systems hallucinate inappropriate content for kids.
arXiv Detail & Related papers (2022-02-17T19:19:09Z) - Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video
Parsing [48.87278703876147]
A new problem, named audio-visual video parsing, aims to parse a video into temporal event segments and label them as audible, visible, or both.
We propose a novel hybrid attention network to explore unimodal and cross-modal temporal contexts simultaneously.
Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels.
arXiv Detail & Related papers (2020-07-21T01:53:31Z) - Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos.
The sound should be both temporally and content-wise aligned with visual signals.
Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.