DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based Image Captioning
- URL: http://arxiv.org/abs/2302.01540v4
- Date: Sat, 26 Apr 2025 17:48:58 GMT
- Title: DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based Image Captioning
- Authors: Dongsheng Xu, Qingbao Huang, Xingmao Zhang, Haonan Cheng, Feng Shuang, Yi Cai,
- Abstract summary: OCR-based image captioning is an important but under-explored task.<n>Recent studies have made encouraging progress, but they are still suffering from a lack of overall understanding of scenes.<n>We propose a Depth and Visual Aware Concepts Transformer (DEVICE) for OCR-based image captinong.
- Score: 12.786190263272859
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: OCR-based image captioning is an important but under-explored task, aiming to generate descriptions containing visual objects and scene text. Recent studies have made encouraging progress, but they are still suffering from a lack of overall understanding of scenes and generating inaccurate captions. One possible reason is that current studies mainly focus on constructing the plane-level geometric relationship of scene text without depth information. This leads to insufficient scene text relational reasoning so that models may describe scene text inaccurately. The other possible reason is that existing methods fail to generate fine-grained descriptions of some visual objects. In addition, they may ignore essential visual objects, leading to the scene text belonging to these ignored objects not being utilized. To address the above issues, we propose a Depth and Visual Concepts Aware Transformer (DEVICE) for OCR-based image captinong. Concretely, to construct three-dimensional geometric relations, we introduce depth information and propose a depth-enhanced feature updating module to ameliorate OCR token features. To generate more precise and comprehensive captions, we introduce semantic features of detected visual concepts as auxiliary information, and propose a semantic-guided alignment module to improve the model's ability to utilize visual concepts. Our DEVICE is capable of comprehending scenes more comprehensively and boosting the accuracy of described visual entities. Sufficient experiments demonstrate the effectiveness of our proposed DEVICE, which outperforms state-of-the-art models on the TextCaps test set.
Related papers
- Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception [42.432917056115166]
Training Large Multimodality Models relies on descriptive image caption that connects image and language.
We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption.
arXiv Detail & Related papers (2024-12-18T18:45:43Z) - VCR: Visual Caption Restoration [80.24176572093512]
We introduce Visual Caption Restoration (VCR), a vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images.
This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.
arXiv Detail & Related papers (2024-06-10T16:58:48Z) - Explore and Tell: Embodied Visual Captioning in 3D Environments [83.00553567094998]
In real-world scenarios, a single image may not offer a good viewpoint, hindering fine-grained scene understanding.
We propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities.
We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task.
arXiv Detail & Related papers (2023-08-21T03:46:04Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - PreSTU: Pre-Training for Scene-Text Understanding [49.288302725486226]
We propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU)
PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content.
We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.
arXiv Detail & Related papers (2022-09-12T18:29:55Z) - Improving Visual Grounding with Visual-Linguistic Verification and
Iterative Reasoning [42.29650807349636]
We propose a transformer-based framework for accurate visual grounding.
We develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions.
A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness.
arXiv Detail & Related papers (2022-04-30T13:48:15Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Text Gestalt: Stroke-Aware Scene Text Image Super-Resolution [31.88960656995447]
We propose a Stroke-Aware Scene Text Image Super-Resolution method containing a Stroke-Focused Module (SFM) to concentrate on stroke-level internal structures of characters in text images.
Specifically, we attempt to design rules for decomposing English characters and digits at stroke-level, then pre-train a text recognizer to provide stroke-level attention maps as positional clues.
The proposed method can indeed generate more distinguishable images on TextZoom and manually constructed Chinese character dataset Degraded-IC13.
arXiv Detail & Related papers (2021-12-13T15:26:10Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.