Fugu-MT 論文翻訳(概要): D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

論文の概要: D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

arxiv url: http://arxiv.org/abs/2509.06771v1
Date: Mon, 08 Sep 2025 14:55:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-09 14:07:04.201948
Title: D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning
Title（参考訳）: D-HUMOR:マルチモーダルなオープンエンド推論による暗風理解
Authors: Sai Kartheek Reddy Kasu, Mohammad Zia Ur Rehman, Shahid Shafi Dar, Rishi Bharat Junghare, Dhanvin Sanjay Namboodiri, Nagendra Kumar,
Abstract要約: オンラインミームにおけるダークユーモアは、暗黙の、敏感で、文化的に文脈的な手がかりに依存しているため、ユニークな課題を提起する。ダークユーモア、ターゲットカテゴリー(性、メンタルヘルス、暴力、人種、障害など)、および3レベルの強度評価のための4,379のミームを新たに導入した。本稿では,まず,大規模視覚言語モデルを用いて,各ミームの構造的説明を生成する推論拡張フレームワークを提案する。
参考スコア（独自算出の注目度）: 4.561044673225099
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Rea soning
Abstract（参考訳）: オンラインミームにおけるダークユーモアは、暗黙の、敏感で、文化的に文脈的な手がかりに依存しているため、ユニークな課題を提起する。マルチモーダルコンテンツにおけるダークユーモアを検出するためのリソースや方法の欠如に対処するために、ダークユーモアに注釈を付けた4,379のRedditミーム、ターゲットカテゴリー(性、メンタルヘルス、暴力、人種、障害など)、および3レベルのインテンシティ評価(マイルド、中等度、重度)のデータセットを導入する。そこで本研究では,まず,大規模視覚言語モデル (Large Vision-Language Model, VLM) を用いて,各ミームの構造的説明を生成する推論拡張フレームワークを提案する。 Role-Reversal Self-Loopを通じて、VLMは著者の視点を採用し、その説明を反復的に洗練し、完全性と整合性を確保する。次に,テキストエンコーダを用いて,OCR文字と自己修正推論の両方からテキスト特徴を抽出し,視覚的特徴を視覚変換器を用いて取得する。 Tri-stream Cross-Reasoning Network (TCRNet)はこれら3つのストリーム、テキスト、画像、推論をペアの注意機構を介して融合し、分類のための統一表現を生成する。実験の結果,提案手法はダークユーモアの検出,ターゲット同定,強度予測という3つのタスクにおいて,強いベースラインを達成できた。データセット、アノテーション、コードは、マルチモーダルユーモアの理解とコンテンツモデレーションのさらなる研究を促進するためにリリースされている。 https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Understanding-via-Multimodal-Open-ended-Reasoning

論文の概要: D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

関連論文リスト