Related papers: MTikGuard System: A Transformer-Based Multimodal System for Child-Safe Content Moderation on TikTok

MTikGuard System: A Transformer-Based Multimodal System for Child-Safe Content Moderation on TikTok

URL: http://arxiv.org/abs/2511.17955v1
Date: Sat, 22 Nov 2025 07:41:16 GMT
Title: MTikGuard System: A Transformer-Based Multimodal System for Child-Safe Content Moderation on TikTok
Authors: Dat Thanh Nguyen, Nguyen Hung Lam, Anh Hoang-Thi Nguyen, Trong-Hop Do,
Abstract summary: MTikGuard is a real-time multimodal harmful content detection system for TikTok.<n>It uses visual, audio, and textual features to achieve state-of-the-art performance with 89.37% accuracy and 89.45% F1-score.
Score: 2.679345223424902
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: With the rapid rise of short-form videos, TikTok has become one of the most influential platforms among children and teenagers, but also a source of harmful content that can affect their perception and behavior. Such content, often subtle or deceptive, challenges traditional moderation methods due to the massive volume and real-time nature of uploads. This paper presents MTikGuard, a real-time multimodal harmful content detection system for TikTok, with three key contributions: (1) an extended TikHarm dataset expanded to 4,723 labeled videos by adding diverse real-world samples, (2) a multimodal classification framework integrating visual, audio, and textual features to achieve state-of-the-art performance with 89.37% accuracy and 89.45% F1-score, and (3) a scalable streaming architecture built on Apache Kafka and Apache Spark for real-time deployment. The results demonstrate the effectiveness of combining dataset expansion, advanced multimodal fusion, and robust deployment for practical large-scale social media content moderation. The dataset is available at https://github.com/ntdat-8324/MTikGuard-System.git.

Related papers

TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization [19.94299183056601]
TripleSumm is a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level.<n>It achieves state-of-the-art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu.
arXiv Detail & Related papers (2026-03-01T16:18:59Z)
Multimodal Hate Detection Using Dual-Stream Graph Neural Networks [20.082029756403976]
Hateful videos present serious risks to online safety and real-world well-being.<n>Although multimodal classification approaches integrate information from several modalities, they typically neglect that even minimal hateful content defines a video's category.<n>We propose a novel multimodal dual-stream graph neural network model that captures structured information in videos.
arXiv Detail & Related papers (2025-09-16T20:20:05Z)
Generative Video Matting [57.186684844156595]
Video matting has traditionally been limited by the lack of high-quality ground-truth data.<n>Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations.<n>We introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models.
arXiv Detail & Related papers (2025-08-11T12:18:55Z)
What You Have is What You Track: Adaptive and Robust Multimodal Tracking [72.92244578461869]
We present the first comprehensive study on tracker performance with temporally incomplete multimodal data.<n>Our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings.
arXiv Detail & Related papers (2025-07-08T11:40:21Z)
HyperFusion: Hierarchical Multimodal Ensemble Learning for Social Media Popularity Prediction [16.78634288864967]
Social media popularity prediction plays a crucial role in content optimization, marketing strategies, and user engagement enhancement across digital platforms.<n>This paper presents HyperFusion, a hierarchical multimodal ensemble learning framework for social media popularity prediction.
arXiv Detail & Related papers (2025-07-01T16:31:50Z)
CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval [70.9990850395981]
We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata.<n>CLaMR is trained to enhance dynamic modality selection via two key innovations.
arXiv Detail & Related papers (2025-06-06T15:02:30Z)
Simple Visual Artifact Detection in Sora-Generated Videos [9.991747596111011]
This study investigates visual artifacts frequently found and reported in Sora-generated videos.<n>We propose a multi-label classification framework targeting four common artifact label types.<n>The best-performing model trained by ResNet-50 achieved an average multi-label classification accuracy of 94.14%.
arXiv Detail & Related papers (2025-04-30T05:41:43Z)
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models [111.97026994761254]
Mixture-of-Transformers (MoT) is a sparse multi-modal transformer architecture.<n>MoT decouples non-embedding parameters of the model by modality.<n>We evaluate MoT across multiple settings and model scales.
arXiv Detail & Related papers (2024-11-07T18:59:06Z)
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.<n>We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.<n>Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z)
UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos [52.161513027831646]
We focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV) We propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously. Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines.
arXiv Detail & Related papers (2024-06-24T03:55:25Z)
Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z)
GAME-ON: Graph Attention Network based Multimodal Fusion for Fake News Detection [6.037721620350107]
We propose GAME-ON, a Graph Neural Network based end-to-end trainable framework to learn more robust data representations for multimodal fake news detection. Our model outperforms on Twitter by an average of 11% and keeps competitive performance on Weibo, within a 2.6% margin, while using 65% fewer parameters than the best comparable state-of-the-art baseline.
arXiv Detail & Related papers (2022-02-25T03:27:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.