Related papers: CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization

CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization

URL: http://arxiv.org/abs/2602.02175v2
Date: Tue, 03 Feb 2026 04:22:27 GMT
Title: CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization
Authors: Xinquan Yu, Wei Lu, Xiangyang Luo, Rui Yang,
Abstract summary: Coupling Implicit and Explicit Cues (CIEC) aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs.<n>It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors.<n>For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization.
Score: 25.78477436147408
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To mitigate the threat of misinformation, multimodal manipulation localization has garnered growing attention. Consider that current methods rely on costly and time-consuming fine-grained annotations, such as patch/token-level annotations. This paper proposes a novel framework named Coupling Implicit and Explicit Cues (CIEC), which aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs utilizing only coarse-grained image/sentence-level annotations. It comprises two branches, image-based and text-based weakly-supervised localization. For the former, we devise the Textual-guidance Refine Patch Selection (TRPS) module. It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors. Followed by the background silencing and spatial contrast constraints to suppress interference from irrelevant areas. For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization. Followed by the asymmetric sparse and semantic consistency constraints to mitigate label noise and ensure reliability. Extensive experiments demonstrate the effectiveness of our CIEC, yielding results comparable to fully supervised methods on several evaluation metrics.

Related papers

Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought [55.65577137924979]
We propose a framework that enables MLLMs to reason over images using continuous numerical coordinates.<n> NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space.<n>Experiments on three benchmarks demonstrate that NV-CoT significantly improves localization precision and final answer accuracy.
arXiv Detail & Related papers (2026-02-27T12:04:07Z)
ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization [62.03035862528452]
ForgeryVCR is a framework that materializes imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning.<n>ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks.
arXiv Detail & Related papers (2026-02-15T11:14:47Z)
SAPL: Semantic-Agnostic Prompt Learning in CLIP for Weakly Supervised Image Manipulation Localization [45.19935082419337]
Malicious image manipulation threatens public safety and requires efficient localization methods.<n>Existing weakly supervised methods rely on image-level binary labels and focus on global classification.<n>We propose Semantic-Agnostic Prompt Learning (SAPL) in CLIP, which learns text prompts that intentionally encode non-semantic, boundary-centric cues.
arXiv Detail & Related papers (2026-01-09T07:25:55Z)
Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition [41.77490816513839]
We propose a novel method for zero-shot skeleton action recognition, termed $texttt$textbfFlora$$.<n>Specifically, we attune textual semantics by incorporating direction-aware regional semantics, and a cross-modal consistency objective.<n>Experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data.
arXiv Detail & Related papers (2025-11-12T14:54:53Z)
SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment [8.657941729790599]
We introduce the Semantic-Enhanced Patch Slimming (SEPS) framework, which systematically addresses patch redundancy and ambiguity.<n>Our approach employs a two-stage mechanism to integrate unified semantics from both dense and sparse texts, enabling the identification of salient visual patches.<n>Experiments on Flickr30K and MS-COCO datasets validate that SEPS achieves superior performance.
arXiv Detail & Related papers (2025-11-03T09:41:32Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation. We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z)
Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion. It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing. Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z)
Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples. We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric. Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z)
Convex Combination Consistency between Neighbors for Weakly-supervised Action Localization [26.63463867095924]
We propose a novel WTAL approach named Convex Combination Consistency between Neighbors (C$3$BN) C$3$BN consists of two key ingredients: a micro data augmentation strategy that increases the diversity in-between adjacent snippets, and a macro-micro consistency regularization. Experimental results demonstrate the effectiveness of C$3$BN on top of various baselines for WTAL with video-level and point-level supervisions.
arXiv Detail & Related papers (2022-05-01T05:30:53Z)
Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation. This paper aims to address the issue with a mask-and-predict strategy. We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions. Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.