Weakly-Supervised Image Forgery Localization via Vision-Language Collaborative Reasoning Framework
- URL: http://arxiv.org/abs/2508.01338v1
- Date: Sat, 02 Aug 2025 12:14:29 GMT
- Title: Weakly-Supervised Image Forgery Localization via Vision-Language Collaborative Reasoning Framework
- Authors: Ziqi Sheng, Junyan Wu, Wei Lu, Jiantao Zhou,
- Abstract summary: ViLaCo is a vision-language collaborative reasoning framework that introduces auxiliary semantic supervision distilled from pre-trained vision-language models.<n>ViLaCo substantially outperforms existing WSIFL methods, achieving state-of-the-art performance in both detection and localization accuracy.
- Score: 16.961220047066792
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Image forgery localization aims to precisely identify tampered regions within images, but it commonly depends on costly pixel-level annotations. To alleviate this annotation burden, weakly supervised image forgery localization (WSIFL) has emerged, yet existing methods still achieve limited localization performance as they mainly exploit intra-image consistency clues and lack external semantic guidance to compensate for weak supervision. In this paper, we propose ViLaCo, a vision-language collaborative reasoning framework that introduces auxiliary semantic supervision distilled from pre-trained vision-language models (VLMs), enabling accurate pixel-level localization using only image-level labels. Specifically, ViLaCo first incorporates semantic knowledge through a vision-language feature modeling network, which jointly extracts textual and visual priors using pre-trained VLMs. Next, an adaptive vision-language reasoning network aligns textual semantics and visual features through mutual interactions, producing semantically aligned representations. Subsequently, these representations are passed into dual prediction heads, where the coarse head performs image-level classification and the fine head generates pixel-level localization masks, thereby bridging the gap between weak supervision and fine-grained localization. Moreover, a contrastive patch consistency module is introduced to cluster tampered features while separating authentic ones, facilitating more reliable forgery discrimination. Extensive experiments on multiple public datasets demonstrate that ViLaCo substantially outperforms existing WSIFL methods, achieving state-of-the-art performance in both detection and localization accuracy.
Related papers
- Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach [43.419607730361996]
Vision-Language Models (VLMs) like CLIP achieve cross-modal alignment through contrastive learning.<n>Traditional prompt engineering relies on coarse-grained category labels, neglecting fine-grained local semantics.<n>We propose a plug-and-play solution that enables CLIP to process localized visual descriptors.
arXiv Detail & Related papers (2025-07-04T10:24:26Z) - Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors.<n>Our goal is to resolve this with a vision backbone that effectively captures both local and global image semantics.<n>We propose a new efficient post-training stage for ViTs called locality alignment and a novel fine-tuning procedure called MaskEmbed.
arXiv Detail & Related papers (2024-10-14T21:01:01Z) - Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations.<n>We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules.<n>CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z) - DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature
Alignment [52.489874804051304]
VoLTA is a new vision-language pre-training paradigm that only utilizes image-caption data but fine-grained region-level image understanding.
VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training.
Experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA.
arXiv Detail & Related papers (2022-10-09T01:49:58Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Spatially Consistent Representation Learning [12.120041613482558]
We propose a spatially consistent representation learning algorithm (SCRL) for multi-object and location-specific tasks.
We devise a novel self-supervised objective that tries to produce coherent spatial representations of a randomly cropped local region.
On various downstream localization tasks with benchmark datasets, the proposed SCRL shows significant performance improvements.
arXiv Detail & Related papers (2021-03-10T15:23:45Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.