Is this Harmful? Learning to Predict Harmfulness Ratings from Video
- URL: http://arxiv.org/abs/2106.08323v1
- Date: Tue, 15 Jun 2021 17:57:12 GMT
- Title: Is this Harmful? Learning to Predict Harmfulness Ratings from Video
- Authors: Johan Edstedt, Johan Karlsson, Francisca Benavente, Anette Novak,
Amanda Berg, Michael Felsberg
- Abstract summary: We create a dataset of approximately 4000 video clips, annotated by professionals in the field.
We conduct an in-depth study on our modeling choices and find that we greatly benefit from combining the visual and audio modality.
Our dataset will be made available upon publication.
- Score: 15.059547998989537
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatically identifying harmful content in video is an important task with
a wide range of applications. However, due to the difficulty of collecting
high-quality labels as well as demanding computational requirements, the task
has not had a satisfying general approach. Typically, only small subsets of the
problem are considered, such as identifying violent content. In cases where the
general problem is tackled, rough approximations and simplifications are made
to deal with the lack of labels and computational complexity. In this work, we
identify and tackle the two main obstacles. First, we create a dataset of
approximately 4000 video clips, annotated by professionals in the field.
Secondly, we demonstrate that advances in video recognition enable training
models on our dataset that consider the full context of the scene. We conduct
an in-depth study on our modeling choices and find that we greatly benefit from
combining the visual and audio modality and that pretraining on large-scale
video recognition datasets and class balanced sampling further improves
performance. We additionally perform a qualitative study that reveals the
heavily multi-modal nature of our dataset. Our dataset will be made available
upon publication.
Related papers
- Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences.
This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content.
Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy
Labels [33.659146748289444]
We create a benchmark dataset consisting of around 2 million videos with associated user-generated annotations and other meta information.
We show how a network pretrained on the proposed dataset can help against video corruption and label noise in downstream datasets.
arXiv Detail & Related papers (2021-10-13T16:12:18Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z) - Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences
for Urban Scene Segmentation [57.68890534164427]
In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences and extra images to improve the performance on urban scene segmentation.
We simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data.
Our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks.
arXiv Detail & Related papers (2020-05-20T18:00:05Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.