Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching
- URL: http://arxiv.org/abs/2512.03553v1
- Date: Wed, 03 Dec 2025 08:20:58 GMT
- Title: Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching
- Authors: Wei Chee Yew, Hailun Xu, Sanjay Saha, Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang, Kanchan Sarkar, Zhenheng Yang, Danhui Guan,
- Abstract summary: We present a hybrid moderation framework that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases.<n>In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision.<n>These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.
- Score: 2.9079112030626146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.
Related papers
- IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs [11.254129271889035]
Visual token pruning has emerged as a critical technique for accelerating MLLM inference.<n>IDPruner achieves state-of-the-art performance and superior generalization across diverse architectures and tasks.
arXiv Detail & Related papers (2026-02-10T11:20:24Z) - A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video [0.2864713389096699]
This work presents a comprehensive framework for automatically detecting and extracting personal names from news videos.<n>It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics.<n>The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and inference.
arXiv Detail & Related papers (2026-02-09T19:58:50Z) - SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams [53.78257200138774]
We propose a Self-Evolving Relevance Model approach (SERM), which comprises two complementary multi-agent modules.<n>We evaluate SERM in a large-scale industrial setting, which serves billions of user requests daily.
arXiv Detail & Related papers (2026-01-14T14:31:16Z) - Test-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal Classification [55.56234913868664]
We propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD) for reliable learning on multimodal data.<n>The proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
arXiv Detail & Related papers (2026-01-12T03:14:12Z) - D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning [49.16227597771663]
D2Pruner is a framework that combines debiased importance with a structural pruning mechanism.<n>It reduces FLOPs by 74.2% while retaining 99.2% of its original performance.<n>It marks a significant advancement with up to 63. 53% improvement over existing methods.
arXiv Detail & Related papers (2025-12-22T14:42:31Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - Bias Testing and Mitigation in Black Box LLMs using Metamorphic Relations [12.095552079739933]
We propose a unified framework for both systematic bias evaluation and targeted mitigation.<n>Our approach introduces six novel Metamorphic Relations (MRs)<n>MRs transform direct bias-inducing inputs into semantically equivalent yet adversarially challenging variants.
arXiv Detail & Related papers (2025-11-29T16:56:38Z) - Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models [9.47743870776814]
Vision-Language Models (VLMs) learn joint representations by mapping images and text into a shared latent space.<n>GroVE builds on the Gaussian Process Latent Variable Model (GPLVM) to learn a shared low-dimensional latent space where image and text inputs are mapped to a unified representation.<n>GroVE achieves state-of-the-art uncertainty calibration across multiple downstream tasks, including cross-modal retrieval, visual question answering, and active learning.
arXiv Detail & Related papers (2025-05-08T11:57:35Z) - Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation [61.64052577026623]
Real-world multi-view datasets are often heterogeneous and imperfect.<n>We propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment.<n>Our RML is self-supervised and can also be applied for downstream tasks as a regularization.
arXiv Detail & Related papers (2025-03-06T07:01:08Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Threshold-Consistent Margin Loss for Open-World Deep Metric Learning [42.03620337000911]
Existing losses used in deep metric learning (DML) for image retrieval often lead to highly non-uniform intra-class and inter-class representation structures.
Inconsistency often complicates the threshold selection process when deploying commercial image retrieval systems.
We propose a novel variance-based metric called Operating-Point-Inconsistency-Score (OPIS) that quantifies the variance in the operating characteristics across classes.
arXiv Detail & Related papers (2023-07-08T21:16:41Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.