Related papers: Toward Real Text Manipulation Detection: New Dataset and New Solution

Toward Real Text Manipulation Detection: New Dataset and New Solution

URL: http://arxiv.org/abs/2312.06934v2
Date: Tue, 23 Jan 2024 09:23:40 GMT
Title: Toward Real Text Manipulation Detection: New Dataset and New Solution
Authors: Dongliang Luo, Yuliang Liu, Rui Yang, Xianjin Liu, Jishen Zeng, Yu Zhou, Xiang Bai
Abstract summary: High costs associated with professional text manipulation limit the availability of real-world datasets. We present the Real Text Manipulation dataset, encompassing 14,250 text images. Our contributions aim to propel advancements in real-world text tampering detection.
Score: 58.557504531896704
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the surge in realistic text tampering, detecting fraudulent text in images has gained prominence for maintaining information security. However, the high costs associated with professional text manipulation and annotation limit the availability of real-world datasets, with most relying on synthetic tampering, which inadequately replicates real-world tampering attributes. To address this issue, we present the Real Text Manipulation (RTM) dataset, encompassing 14,250 text images, which include 5,986 manually and 5,258 automatically tampered images, created using a variety of techniques, alongside 3,006 unaltered text images for evaluating solution stability. Our evaluations indicate that existing methods falter in text forgery detection on the RTM dataset. We propose a robust baseline solution featuring a Consistency-aware Aggregation Hub and a Gated Cross Neighborhood-attention Fusion module for efficient multi-modal information fusion, supplemented by a Tampered-Authentic Contrastive Learning module during training, enriching feature representation distinction. This framework, extendable to other dual-stream architectures, demonstrated notable localization performance improvements of 7.33% and 6.38% on manual and overall manipulations, respectively. Our contributions aim to propel advancements in real-world text tampering detection. Code and dataset will be made available at https://github.com/DrLuo/RTM

Related papers

Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies [3.7498611358320733]
Synthetic images cannot faithfully reproduce real-world scenarios, resulting in performance disparities when handling complex real-world images.<n>Recent self-supervised learning techniques, notably contrastive learning and masked image modeling, narrow this domain gap by exploiting unlabeled real text images.<n>Our Multi-Masking Strategy (MMS) integrates random patch, blockwise, and span masking into the MIM frame, which jointly learns low and high-level textual representations.
arXiv Detail & Related papers (2025-05-11T05:52:55Z)
Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z)
TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition [7.560970003549404]
TextSSR is a novel framework for Synthesizing Scene Text Recognition data. It ensures accuracy by focusing on generating text within a specified image region. We construct an anagram-based TextSSR-F dataset with 0.4 million text instances with complexity and realism.
arXiv Detail & Related papers (2024-12-02T05:26:25Z)
Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition [56.968108142307976]
We propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) Our approach introduces canonical class-aware glyph masks to suppress background and text style noise. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion.
arXiv Detail & Related papers (2024-02-21T09:22:45Z)
Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z)
Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs [96.54224331778195]
We present a text-grounding document understanding model, termed TGDoc, which enhances MLLMs with the ability to discern the spatial positioning of text within images. We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model. Our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
arXiv Detail & Related papers (2023-11-22T06:46:37Z)
Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning [19.856492291263102]
We propose representation learning for real-time scene text detection. For semantic representation learning, we propose global-dense semantic contrast (GDSC) and top-down modeling (TDM) With the proposed GDSC and TDM, the encoder network learns stronger representation without introducing any parameters and computations during inference. The proposed method achieves 87.2% F-measure with 48.2 FPS on Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500.
arXiv Detail & Related papers (2023-08-14T15:14:37Z)
TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image Super-Resolution [18.73348268987249]
TextDiff is a diffusion-based framework tailored for scene text image super-resolution. It achieves state-of-the-art (SOTA) performance on public benchmark datasets. Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
arXiv Detail & Related papers (2023-08-13T11:02:16Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning. CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z)
Stroke-Based Scene Text Erasing Using Synthetic Data [0.0]
Scene text erasing can replace text regions with reasonable content in natural images. The lack of a large-scale real-world scene-text removal dataset allows the existing methods to not work in full strength. We enhance and make full use of the synthetic text and consequently train our model only on the dataset generated by the improved synthetic text engine. This model can partially erase text instances in a scene image with a bounding box provided or work with an existing scene text detector for automatic scene text erasing.
arXiv Detail & Related papers (2021-04-23T09:29:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.