Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling
- URL: http://arxiv.org/abs/2508.03296v1
- Date: Tue, 05 Aug 2025 10:16:04 GMT
- Title: Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling
- Authors: Anqi Li, Wenwei Jin, Jintao Tong, Pengda Qin, Weijia Li, Guo Lu,
- Abstract summary: Hi-Guard is a multimodal moderation framework that introduces a new policy-aligned decision paradigm.<n>To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt.<n>Experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability.
- Score: 22.914127076888086
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Social platforms have revolutionized information sharing, but also accelerated the dissemination of harmful and policy-violating content. To ensure safety and compliance at scale, moderation systems must go beyond efficiency and offer accuracy and interpretability. However, current approaches largely rely on noisy, label-driven learning, lacking alignment with moderation rules and producing opaque decisions that hinder human review. Therefore, we propose Hierarchical Guard (Hi-Guard), a multimodal moderation framework that introduces a new policy-aligned decision paradigm. The term "Hierarchical" reflects two key aspects of our system design: (1) a hierarchical moderation pipeline, where a lightweight binary model first filters safe content and a stronger model handles fine-grained risk classification; and (2) a hierarchical taxonomy in the second stage, where the model performs path-based classification over a hierarchical taxonomy ranging from coarse to fine-grained levels. To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt. To further enhance structured prediction and reasoning, we introduce a multi-level soft-margin reward and optimize with Group Relative Policy Optimization (GRPO), penalizing semantically adjacent misclassifications and improving explanation quality. Extensive experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability, paving the way toward scalable, transparent, and trustworthy content safety systems. Code is available at: https://github.com/lianqi1008/Hi-Guard.
Related papers
- HiD-VAE: Interpretable Generative Recommendation via Hierarchical and Disentangled Semantic IDs [33.51075655987504]
HiD-VAE is a novel framework that learns hierarchically disentangled item representations through two core innovations.<n>First, HiD-VAE pioneers a hierarchically-supervised quantization process that aligns discrete codes with multi-level item tags.<n>Second, to combat representation entanglement, HiD-VAE incorporates a novel uniqueness loss that directly penalizes latent space overlap.
arXiv Detail & Related papers (2025-08-06T16:45:05Z) - Towards Privacy-Preserving Fine-Grained Visual Classification via Hierarchical Learning from Label Proportions [25.974006393027228]
This paper aims to enable accurate fine-grained recognition without direct access to instance labels.<n>Unlike existing LLP-based methods, our framework explicitly exploits the hierarchical nature of fine-grained datasets.
arXiv Detail & Related papers (2025-05-29T03:18:25Z) - Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning [58.16354555208417]
PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively.<n>The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors.<n>We present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces.
arXiv Detail & Related papers (2025-05-19T16:35:45Z) - Enforcing Consistency and Fairness in Multi-level Hierarchical Classification with a Mask-based Output Layer [25.819440955594736]
We introduce a fair, model-agnostic layer designed to enforce taxonomy and optimize objectives, including consistency, fairness, and exact match.<n>Our evaluations demonstrate that the proposed layer not only improves the fairness of predictions but also enforces the taxonomy, resulting in consistent predictions and superior performance.
arXiv Detail & Related papers (2025-03-19T06:30:04Z) - Bidirectional Logits Tree: Pursuing Granularity Reconcilement in Fine-Grained Classification [89.20477310885731]
This paper addresses the challenge of Granularity Competition in fine-grained classification tasks.<n>Existing approaches typically develop independent hierarchy-aware models based on shared features extracted from a common base encoder.<n>We propose a novel framework called the Bidirectional Logits Tree (BiLT) for Granularity Reconcilement.
arXiv Detail & Related papers (2024-12-17T10:42:19Z) - Hierarchical Selective Classification [17.136832159667204]
This paper introduces hierarchical selective classification, extending selective classification to a hierarchical setting.<n>We first formalize hierarchical risk and coverage, and introduce hierarchical risk-coverage curves.<n>Next, we develop algorithms for hierarchical selective classification, and propose an efficient algorithm that guarantees a target accuracy constraint with high probability.
arXiv Detail & Related papers (2024-05-19T12:24:30Z) - Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing [87.48628403354351]
certification for machine learning is proving that no adversarial sample can evade a model within a range under certain conditions.
Common certification methods for segmentation use a flat set of fine-grained classes, leading to high abstain rates due to model uncertainty.
We propose a novel, more practical setting, which certifies pixels within a multi-level hierarchy, and adaptively relaxes the certification to a coarser level for unstable components.
arXiv Detail & Related papers (2024-02-13T11:59:43Z) - SemiReward: A General Reward Model for Semi-supervised Learning [58.47299780978101]
Semi-supervised learning (SSL) has witnessed great progress with various improvements in the self-training framework with pseudo labeling.
Main challenge is how to distinguish high-quality pseudo labels against the confirmation bias.
We propose a Semi-supervised Reward framework (SemiReward) that predicts reward scores to evaluate and filter out high-quality pseudo labels.
arXiv Detail & Related papers (2023-10-04T17:56:41Z) - Option-Aware Adversarial Inverse Reinforcement Learning for Robotic
Control [44.77500987121531]
Hierarchical Imitation Learning (HIL) has been proposed to recover highly-complex behaviors in long-horizon tasks from expert demonstrations.
We develop a novel HIL algorithm based on Adversarial Inverse Reinforcement Learning.
We also propose a Variational Autoencoder framework for learning with our objectives in an end-to-end fashion.
arXiv Detail & Related papers (2022-10-05T00:28:26Z) - Weakly-supervised Action Localization via Hierarchical Mining [76.00021423700497]
Weakly-supervised action localization aims to localize and classify action instances in the given videos temporally with only video-level categorical labels.
We propose a hierarchical mining strategy under video-level and snippet-level manners, i.e., hierarchical supervision and hierarchical consistency mining.
We show that HiM-Net outperforms existing methods on THUMOS14 and ActivityNet1.3 datasets with large margins by hierarchically mining the supervision and consistency.
arXiv Detail & Related papers (2022-06-22T12:19:09Z) - Interpretable Reinforcement Learning with Multilevel Subgoal Discovery [77.34726150561087]
We propose a novel Reinforcement Learning model for discrete environments.
In the model, an agent learns information about environment in the form of probabilistic rules.
No reward function is required for learning; an agent only needs to be given a primary goal to achieve.
arXiv Detail & Related papers (2022-02-15T14:04:44Z) - Semi-Supervised Domain Generalization with Stochastic StyleMatch [90.98288822165482]
In real-world applications, we might have only a few labels available from each source domain due to high annotation cost.
In this work, we investigate semi-supervised domain generalization, a more realistic and practical setting.
Our proposed approach, StyleMatch, is inspired by FixMatch, a state-of-the-art semi-supervised learning method based on pseudo-labeling.
arXiv Detail & Related papers (2021-06-01T16:00:08Z) - Hierarchical Modeling for Out-of-Scope Domain and Intent Classification [55.23920796595698]
This paper focuses on out-of-scope intent classification in dialog systems.
We propose a hierarchical multi-task learning approach based on a joint model to classify domain and intent simultaneously.
Experiments show that the model outperforms existing methods in terms of accuracy, out-of-scope recall and F1.
arXiv Detail & Related papers (2021-04-30T06:38:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.