Related papers: TOT: Topology-Aware Optimal Transport For Multimodal Hate Detection

TOT: Topology-Aware Optimal Transport For Multimodal Hate Detection

URL: http://arxiv.org/abs/2303.09314v2
Date: Mon, 24 Apr 2023 09:23:25 GMT
Title: TOT: Topology-Aware Optimal Transport For Multimodal Hate Detection
Authors: Linhao Zhang, Li Jin, Xian Sun, Guangluan Xu, Zequn Zhang, Xiaoyu Li, Nayu Liu, Qing Liu, Shiyao Yan
Abstract summary: We propose TOT: a topology-aware optimal transport framework to decipher the implicit harm in memes scenario. Specifically, we leverage an optimal transport kernel method to capture complementary information from multiple modalities. The newly achieved state-of-the-art performance on two publicly available benchmark datasets, together with further visual analysis, demonstrate the superiority of TOT.
Score: 18.015012133043093
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal hate detection, which aims to identify harmful content online such as memes, is crucial for building a wholesome internet environment. Previous work has made enlightening exploration in detecting explicit hate remarks. However, most of their approaches neglect the analysis of implicit harm, which is particularly challenging as explicit text markers and demographic visual cues are often twisted or missing. The leveraged cross-modal attention mechanisms also suffer from the distributional modality gap and lack logical interpretability. To address these semantic gaps issues, we propose TOT: a topology-aware optimal transport framework to decipher the implicit harm in memes scenario, which formulates the cross-modal aligning problem as solutions for optimal transportation plans. Specifically, we leverage an optimal transport kernel method to capture complementary information from multiple modalities. The kernel embedding provides a non-linear transformation ability to reproduce a kernel Hilbert space (RKHS), which reflects significance for eliminating the distributional modality gap. Moreover, we perceive the topology information based on aligned representations to conduct bipartite graph path reasoning. The newly achieved state-of-the-art performance on two publicly available benchmark datasets, together with further visual analysis, demonstrate the superiority of TOT in capturing implicit cross-modal alignment.

Related papers

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints [15.541287957548771]
We propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture. It integrates implicit and explicit modeling approaches within a two-stage framework. It significantly outperforms state-of-the-art REC and RIS methods by a substantial margin.
arXiv Detail & Related papers (2025-01-12T04:30:13Z)
Fast Disentangled Slim Tensor Learning for Multi-view Clustering [28.950845031752927]
We propose a new approach termed fast Disdentangle Slim Learning (DSTL) for multi-view clustering. To alleviate the negative influence of feature redundancy, inspired by robust PCA, DSTL disentangles the latent low-dimensional representation into a semantic-unrelated part and a semantic-related part for each view. Our proposed model is computationally efficient and can be solved effectively.
arXiv Detail & Related papers (2024-11-12T09:57:53Z)
GASE: Graph Attention Sampling with Edges Fusion for Solving Vehicle Routing Problems [6.084414764415137]
We propose an adaptive Graph Attention Sampling with the Edges Fusion framework to solve vehicle routing problems. Our proposed model outperforms the existing methods by 2.08%-6.23% and shows stronger generalization ability.
arXiv Detail & Related papers (2024-05-21T03:33:07Z)
Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement [58.9768112704998]
Disentangled representation learning strives to extract the intrinsic factors within observed data. We introduce a new perspective and framework, demonstrating that diffusion models with cross-attention can serve as a powerful inductive bias. This is the first work to reveal the potent disentanglement capability of diffusion models with cross-attention, requiring no complex designs.
arXiv Detail & Related papers (2024-02-15T05:07:54Z)
Unified Domain Adaptive Semantic Segmentation [96.74199626935294]
Unsupervised Adaptive Domain Semantic (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain. We propose a Quad-directional Mixup (QuadMix) method, characterized by tackling distinct point attributes and feature inconsistencies. Our method outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks.
arXiv Detail & Related papers (2023-11-22T09:18:49Z)
Disentangled Representation Learning with Transmitted Information Bottleneck [57.22757813140418]
We present textbfDisTIB (textbfTransmitted textbfInformation textbfBottleneck for textbfDisd representation learning), a novel objective that navigates the balance between information compression and preservation.
arXiv Detail & Related papers (2023-11-03T03:18:40Z)
Improving Vision Anomaly Detection with the Guidance of Language Modality [64.53005837237754]
This paper tackles the challenges for vision modality from a multimodal point of view. We propose Cross-modal Guidance (CMG) to tackle the redundant information issue and sparse space issue. To learn a more compact latent space for the vision anomaly detector, CMLE learns a correlation structure matrix from the language modality.
arXiv Detail & Related papers (2023-10-04T13:44:56Z)
Cross-Modal Translation and Alignment for Survival Analysis [7.657906359372181]
We present a framework to explore the intrinsic cross-modal correlations and transfer potential complementary information. Our experiments on five public TCGA datasets demonstrate that our proposed framework outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2023-09-22T13:29:14Z)
Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision. Existing literature addresses this challenge by employing local-based representation approaches. This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z)
Multimodal Trajectory Prediction via Topological Invariance for Navigation at Uncontrolled Intersections [45.508973373913946]
We focus on decentralized navigation among multiple non-communicating rational agents at street intersections without traffic signs or signals. Our key insight is that the geometric structure of the intersection and the incentive of agents to move efficiently and avoid collisions (rationality) reduces the space of likely behaviors. We design Multiple Topologies Prediction (MTP), a data-driven trajectory-prediction mechanism that reconstructs trajectory representations of high-likelihood modes in multiagent intersection scenes.
arXiv Detail & Related papers (2020-11-08T02:56:42Z)
Representation Learning via Adversarially-Contrastive Optimal Transport [40.52344027750609]
We set the problem within the context of contrastive representation learning. We propose a framework connecting Wasserstein GANs with a novel classifier. Our results demonstrate competitive performance against challenging baselines.
arXiv Detail & Related papers (2020-07-11T19:46:18Z)
MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis [48.776247141839875]
We propose a novel framework, MISA, which projects each modality to two distinct subspaces. The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap. Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models.
arXiv Detail & Related papers (2020-05-07T15:13:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.