Related papers: To show or not to show: Redacting sensitive text from videos of electronic displays

To show or not to show: Redacting sensitive text from videos of electronic displays

URL: http://arxiv.org/abs/2208.10270v1
Date: Fri, 19 Aug 2022 07:53:04 GMT
Title: To show or not to show: Redacting sensitive text from videos of electronic displays
Authors: Abhishek Mukhopadhyay, Shubham Agarwal, Patrick Dylan Zwick, and Pradipta Biswas
Abstract summary: We define an approach for redacting personally identifiable text from videos using a combination of optical character recognition (OCR) and natural language processing (NLP) techniques. We examine the relative performance of this approach when used with different OCR models, specifically Tesseract and the OCR system from Google Cloud Vision (GCV)
Score: 4.621328863799446
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: With the increasing prevalence of video recordings there is a growing need for tools that can maintain the privacy of those recorded. In this paper, we define an approach for redacting personally identifiable text from videos using a combination of optical character recognition (OCR) and natural language processing (NLP) techniques. We examine the relative performance of this approach when used with different OCR models, specifically Tesseract and the OCR system from Google Cloud Vision (GCV). For the proposed approach the performance of GCV, in both accuracy and speed, is significantly higher than Tesseract. Finally, we explore the advantages and disadvantages of both models in real-world applications.

Related papers

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments [3.5936169218390703]
This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements.
arXiv Detail & Related papers (2025-02-10T13:20:19Z)
Do Current Video LLMs Have Strong OCR Abilities? A Preliminary Study [5.667343827196717]
This paper introduces a novel benchmark designed to evaluate the video OCR performance of multi-modal models in videos.<n>We developed this benchmark using a semi-automated approach that integrates the OCR ability of image LLMs with manual refinement, balancing efficiency, cost, and data quality.
arXiv Detail & Related papers (2024-12-29T23:20:01Z)
See then Tell: Enhancing Key Information Extraction with Vision Grounding [54.061203106565706]
We introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding. To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets.
arXiv Detail & Related papers (2024-09-29T06:21:05Z)
UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model. We show that UNIT significantly outperforms existing methods on document-related tasks. Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z)
Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. We introduce a novel method named Decoder Pre-training with only text for STR (DPTR) DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z)
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding [18.609441902943445]
VisFocus is an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt. We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder. Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance.
arXiv Detail & Related papers (2024-07-17T14:16:46Z)
RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [69.23782518456932]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA) We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z)
DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models. We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z)
Text-Conditioned Resampler For Long Form Video Understanding [94.81955667020867]
We present a text-conditioned video resampler (TCR) module that uses a pre-trained visual encoder and large language model (LLM) TCR can process more than 100 frames at a time with plain attention and without optimised implementations.
arXiv Detail & Related papers (2023-12-19T06:42:47Z)
Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents [0.8158530638728501]
This paper evaluates the impact of image processing methods and parameter tuning in Optical Character Recognition (OCR) The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) Our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results.
arXiv Detail & Related papers (2023-11-27T11:44:46Z)
Fine-grained Text-Video Retrieval with Frozen Image Encoders [10.757101644990273]
We propose CrossTVR, a two-stage text-video retrieval architecture. In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection. In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions.
arXiv Detail & Related papers (2023-07-14T02:57:00Z)
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge. We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z)
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework. We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images. We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z)
Adaptive Compact Attention For Few-shot Video-to-video Translation [13.535988102579918]
We introduce a novel adaptive compact attention mechanism to efficiently extract contextual features jointly from multiple reference images. Our core idea is to extract compact basis sets from all the reference images as higher-level representations. We extensively evaluate our method on a large-scale talking-head video dataset and a human dancing dataset.
arXiv Detail & Related papers (2020-11-30T11:19:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.