Related papers: HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection

HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection

URL: http://arxiv.org/abs/2508.01712v1
Date: Sun, 03 Aug 2025 10:46:06 GMT
Title: HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection
Authors: Han Wang, Zhuoran Wang, Roy Ka-Wei Lee,
Abstract summary: HateClipSeg is a large-scale multimodal dataset with both video-level and segment-level annotations.<n>Our three-stage annotation process yields high inter-annotator agreement.<n>Results highlight substantial gaps in current models.
Score: 8.323983138164547
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Detecting hate speech in videos remains challenging due to the complexity of multimodal content and the lack of fine-grained annotations in existing datasets. We present HateClipSeg, a large-scale multimodal dataset with both video-level and segment-level annotations, comprising over 11,714 segments labeled as Normal or across five Offensive categories: Hateful, Insulting, Sexual, Violence, Self-Harm, along with explicit target victim labels. Our three-stage annotation process yields high inter-annotator agreement (Krippendorff's alpha = 0.817). We propose three tasks to benchmark performance: (1) Trimmed Hateful Video Classification, (2) Temporal Hateful Video Localization, and (3) Online Hateful Video Classification. Results highlight substantial gaps in current models, emphasizing the need for more sophisticated multimodal and temporally aware approaches. The HateClipSeg dataset are publicly available at https://github.com/Social-AI-Studio/HateClipSeg.git.

Related papers

Revealing Temporal Label Noise in Multimodal Hateful Video Classification [17.69786804367003]
We investigate the impact of label ambiguity through a fine-grained approach.<n>We trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps.<n>We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content.
arXiv Detail & Related papers (2025-08-06T21:55:59Z)
SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [80.3895950009792]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z)
Simple Visual Artifact Detection in Sora-Generated Videos [9.991747596111011]
This study investigates visual artifacts frequently found and reported in Sora-generated videos.<n>We propose a multi-label classification framework targeting four common artifact label types.<n>The best-performing model trained by ResNet-50 achieved an average multi-label classification accuracy of 94.14%.
arXiv Detail & Related papers (2025-04-30T05:41:43Z)
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions [3.9633773442108873]
We propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration.<n>The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity.
arXiv Detail & Related papers (2025-03-07T07:15:06Z)
Cross-Modal Transfer from Memes to Videos: Addressing Data Scarcity in Hateful Video Detection [8.05088621131726]
Video-based hate speech detection remains under-explored, hindered by a lack of annotated datasets and the high cost of video annotation.<n>We leverage meme datasets as both a substitution and an augmentation strategy for training hateful video detection models.<n>Our results consistently outperform state-of-the-art benchmarks.
arXiv Detail & Related papers (2025-01-26T07:50:14Z)
Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips. Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets. Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z)
Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA) It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z)
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications. Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities. We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z)
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z)
Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. This enables the learning of long-range dependencies beyond a single clip. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.