LogicLens: Visual-Logical Co-Reasoning for Text-Centric Forgery Analysis
- URL: http://arxiv.org/abs/2512.21482v1
- Date: Thu, 25 Dec 2025 03:02:27 GMT
- Title: LogicLens: Visual-Logical Co-Reasoning for Text-Centric Forgery Analysis
- Authors: Fanwei Zeng, Changtao Miao, Jing Huang, Zhiya Tan, Shutao Gong, Xiaoming Yu, Yang Wang, Huazhe Tan, Weibin Yao, Jianshu Li,
- Abstract summary: Text-centric forgeries pose a significant threat to societal security and information authenticity.<n>Current methods for text-centric forgery analysis are often limited to coarse-grained visual analysis.<n>We introduce LogicLens, a unified framework for Visual-Textual Co-reasoning.
- Score: 10.305807834419765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sophisticated text-centric forgeries, fueled by rapid AIGC advancements, pose a significant threat to societal security and information authenticity. Current methods for text-centric forgery analysis are often limited to coarse-grained visual analysis and lack the capacity for sophisticated reasoning. Moreover, they typically treat detection, grounding, and explanation as discrete sub-tasks, overlooking their intrinsic relationships for holistic performance enhancement. To address these challenges, we introduce LogicLens, a unified framework for Visual-Textual Co-reasoning that reformulates these objectives into a joint task. The deep reasoning of LogicLens is powered by our novel Cross-Cues-aware Chain of Thought (CCT) mechanism, which iteratively cross-validates visual cues against textual logic. To ensure robust alignment across all tasks, we further propose a weighted multi-task reward function for GRPO-based optimization. Complementing this framework, we first designed the PR$^2$ (Perceiver, Reasoner, Reviewer) pipeline, a hierarchical and iterative multi-agent system that generates high-quality, cognitively-aligned annotations. Then, we constructed RealText, a diverse dataset comprising 5,397 images with fine-grained annotations, including textual explanations, pixel-level segmentation, and authenticity labels for model training. Extensive experiments demonstrate the superiority of LogicLens across multiple benchmarks. In a zero-shot evaluation on T-IC13, it surpasses the specialized framework by 41.4% and GPT-4o by 23.4% in macro-average F1 score. Moreover, on the challenging dense-text T-SROIE dataset, it establishes a significant lead over other MLLM-based methods in mF1, CSS, and the macro-average F1. Our dataset, model, and code will be made publicly available.
Related papers
- ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z) - Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings [16.728984584960738]
This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings.<n>We investigate two primary in-process methods: sequential concatenation and parallel caching.<n>Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors.
arXiv Detail & Related papers (2025-10-09T19:45:54Z) - CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning [93.05917922306196]
Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2025-10-09T09:41:45Z) - Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models [58.456656119178064]
Vision-Language Models (VLMs) have emerged as foundational for multimodal intelligence.<n>However, their capacity for logical understanding remains significantly underexplored.<n>We introduce LogicBench, a benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios.<n>We propose LogicCLIP, a training framework designed to boost VLMs' logical sensitivity.
arXiv Detail & Related papers (2025-08-15T08:40:13Z) - Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts [64.93416171745693]
ThinkFirst is a training-free reasoning segmentation framework.<n>Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image.<n>This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process.
arXiv Detail & Related papers (2025-03-10T16:26:11Z) - Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation [24.081573908824353]
First-order logic (FOL) reasoning is pivotal for intelligent systems.<n>Existing benchmarks often rely on extensive human annotation or handcrafted templates.<n>We propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models with the rigor and precision of symbolic provers.
arXiv Detail & Related papers (2025-02-10T15:31:54Z) - Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception [10.614437503578856]
This paper proposes the Meta-Chunking framework, which specifically enhances chunking quality.<n>We design two adaptive chunking techniques based on uncertainty, namely Perplexity Chunking and Margin Sampling Chunking.<n>We establish the global information compensation mechanism, encompassing a two-stage hierarchical summary generation process and a three-stage text chunk rewriting procedure.
arXiv Detail & Related papers (2024-10-16T17:59:32Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - MURMUR: Modular Multi-Step Reasoning for Semi-Structured Data-to-Text
Generation [102.20036684996248]
We propose MURMUR, a neuro-symbolic modular approach to text generation from semi-structured data with multi-step reasoning.
We conduct experiments on two data-to-text generation tasks like WebNLG and LogicNLG.
arXiv Detail & Related papers (2022-12-16T17:36:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.