Related papers: Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling

Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling

URL: http://arxiv.org/abs/2508.03296v1
Date: Tue, 05 Aug 2025 10:16:04 GMT
Title: Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling
Authors: Anqi Li, Wenwei Jin, Jintao Tong, Pengda Qin, Weijia Li, Guo Lu,
Abstract summary: Hi-Guard is a multimodal moderation framework that introduces a new policy-aligned decision paradigm.<n>To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt.<n>Experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability.
Score: 22.914127076888086
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Social platforms have revolutionized information sharing, but also accelerated the dissemination of harmful and policy-violating content. To ensure safety and compliance at scale, moderation systems must go beyond efficiency and offer accuracy and interpretability. However, current approaches largely rely on noisy, label-driven learning, lacking alignment with moderation rules and producing opaque decisions that hinder human review. Therefore, we propose Hierarchical Guard (Hi-Guard), a multimodal moderation framework that introduces a new policy-aligned decision paradigm. The term "Hierarchical" reflects two key aspects of our system design: (1) a hierarchical moderation pipeline, where a lightweight binary model first filters safe content and a stronger model handles fine-grained risk classification; and (2) a hierarchical taxonomy in the second stage, where the model performs path-based classification over a hierarchical taxonomy ranging from coarse to fine-grained levels. To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt. To further enhance structured prediction and reasoning, we introduce a multi-level soft-margin reward and optimize with Group Relative Policy Optimization (GRPO), penalizing semantically adjacent misclassifications and improving explanation quality. Extensive experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability, paving the way toward scalable, transparent, and trustworthy content safety systems. Code is available at: https://github.com/lianqi1008/Hi-Guard.

Related papers

HiD-VAE: Interpretable Generative Recommendation via Hierarchical and Disentangled Semantic IDs [33.51075655987504]
HiD-VAE is a novel framework that learns hierarchically disentangled item representations through two core innovations.<n>First, HiD-VAE pioneers a hierarchically-supervised quantization process that aligns discrete codes with multi-level item tags.<n>Second, to combat representation entanglement, HiD-VAE incorporates a novel uniqueness loss that directly penalizes latent space overlap.
arXiv Detail & Related papers (2025-08-06T16:45:05Z)
Towards Privacy-Preserving Fine-Grained Visual Classification via Hierarchical Learning from Label Proportions [25.974006393027228]
This paper aims to enable accurate fine-grained recognition without direct access to instance labels.<n>Unlike existing LLP-based methods, our framework explicitly exploits the hierarchical nature of fine-grained datasets.
arXiv Detail & Related papers (2025-05-29T03:18:25Z)
Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning [58.16354555208417]
PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively.<n>The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors.<n>We present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces.
arXiv Detail & Related papers (2025-05-19T16:35:45Z)
Enforcing Consistency and Fairness in Multi-level Hierarchical Classification with a Mask-based Output Layer [25.819440955594736]
We introduce a fair, model-agnostic layer designed to enforce taxonomy and optimize objectives, including consistency, fairness, and exact match.<n>Our evaluations demonstrate that the proposed layer not only improves the fairness of predictions but also enforces the taxonomy, resulting in consistent predictions and superior performance.
arXiv Detail & Related papers (2025-03-19T06:30:04Z)
Bidirectional Logits Tree: Pursuing Granularity Reconcilement in Fine-Grained Classification [89.20477310885731]
This paper addresses the challenge of Granularity Competition in fine-grained classification tasks.<n>Existing approaches typically develop independent hierarchy-aware models based on shared features extracted from a common base encoder.<n>We propose a novel framework called the Bidirectional Logits Tree (BiLT) for Granularity Reconcilement.
arXiv Detail & Related papers (2024-12-17T10:42:19Z)
Hierarchical Selective Classification [17.136832159667204]
This paper introduces hierarchical selective classification, extending selective classification to a hierarchical setting.<n>We first formalize hierarchical risk and coverage, and introduce hierarchical risk-coverage curves.<n>Next, we develop algorithms for hierarchical selective classification, and propose an efficient algorithm that guarantees a target accuracy constraint with high probability.
arXiv Detail & Related papers (2024-05-19T12:24:30Z)
Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing [87.48628403354351]
certification for machine learning is proving that no adversarial sample can evade a model within a range under certain conditions. Common certification methods for segmentation use a flat set of fine-grained classes, leading to high abstain rates due to model uncertainty. We propose a novel, more practical setting, which certifies pixels within a multi-level hierarchy, and adaptively relaxes the certification to a coarser level for unstable components.
arXiv Detail & Related papers (2024-02-13T11:59:43Z)
SemiReward: A General Reward Model for Semi-supervised Learning [58.47299780978101]
Semi-supervised learning (SSL) has witnessed great progress with various improvements in the self-training framework with pseudo labeling. Main challenge is how to distinguish high-quality pseudo labels against the confirmation bias. We propose a Semi-supervised Reward framework (SemiReward) that predicts reward scores to evaluate and filter out high-quality pseudo labels.
arXiv Detail & Related papers (2023-10-04T17:56:41Z)
Option-Aware Adversarial Inverse Reinforcement Learning for Robotic Control [44.77500987121531]
Hierarchical Imitation Learning (HIL) has been proposed to recover highly-complex behaviors in long-horizon tasks from expert demonstrations. We develop a novel HIL algorithm based on Adversarial Inverse Reinforcement Learning. We also propose a Variational Autoencoder framework for learning with our objectives in an end-to-end fashion.
arXiv Detail & Related papers (2022-10-05T00:28:26Z)
Weakly-supervised Action Localization via Hierarchical Mining [76.00021423700497]
Weakly-supervised action localization aims to localize and classify action instances in the given videos temporally with only video-level categorical labels. We propose a hierarchical mining strategy under video-level and snippet-level manners, i.e., hierarchical supervision and hierarchical consistency mining. We show that HiM-Net outperforms existing methods on THUMOS14 and ActivityNet1.3 datasets with large margins by hierarchically mining the supervision and consistency.
arXiv Detail & Related papers (2022-06-22T12:19:09Z)
Interpretable Reinforcement Learning with Multilevel Subgoal Discovery [77.34726150561087]
We propose a novel Reinforcement Learning model for discrete environments. In the model, an agent learns information about environment in the form of probabilistic rules. No reward function is required for learning; an agent only needs to be given a primary goal to achieve.
arXiv Detail & Related papers (2022-02-15T14:04:44Z)
Semi-Supervised Domain Generalization with Stochastic StyleMatch [90.98288822165482]
In real-world applications, we might have only a few labels available from each source domain due to high annotation cost. In this work, we investigate semi-supervised domain generalization, a more realistic and practical setting. Our proposed approach, StyleMatch, is inspired by FixMatch, a state-of-the-art semi-supervised learning method based on pseudo-labeling.
arXiv Detail & Related papers (2021-06-01T16:00:08Z)
Hierarchical Modeling for Out-of-Scope Domain and Intent Classification [55.23920796595698]
This paper focuses on out-of-scope intent classification in dialog systems. We propose a hierarchical multi-task learning approach based on a joint model to classify domain and intent simultaneously. Experiments show that the model outperforms existing methods in terms of accuracy, out-of-scope recall and F1.
arXiv Detail & Related papers (2021-04-30T06:38:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.