DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps
- URL: http://arxiv.org/abs/2302.01540v1
- Date: Fri, 3 Feb 2023 04:31:13 GMT
- Title: DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps
- Authors: Dongsheng Xu, Qingbao Huang, Yi Cai
- Abstract summary: We propose a DEpth and VIsual ConcEpts Aware Transformer (DEVICE) for TextCaps.
Our DEVICE is capable of generalizing scenes more comprehensively and boosting the accuracy of described visual entities.
- Score: 10.87327544629769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based image captioning is an important but under-explored task, aiming
to generate descriptions containing visual objects and scene text. Recent
studies have made encouraging progress, but they are still suffering from a
lack of overall understanding of scenes and generating inaccurate captions. One
possible reason is that current studies mainly focus on constructing the
plane-level geometric relationship of scene text without depth information.
This leads to insufficient scene text relational reasoning so that models may
describe scene text inaccurately. The other possible reason is that existing
methods fail to generate fine-grained descriptions of some visual objects. In
addition, they may ignore essential visual objects, leading to the scene text
belonging to these ignored objects not being utilized. To address the above
issues, we propose a DEpth and VIsual ConcEpts Aware Transformer (DEVICE) for
TextCaps. Concretely, to construct three-dimensional geometric relations, we
introduce depth information and propose a depth-enhanced feature updating
module to ameliorate OCR token features. To generate more precise and
comprehensive captions, we introduce semantic features of detected visual
object concepts as auxiliary information. Our DEVICE is capable of generalizing
scenes more comprehensively and boosting the accuracy of described visual
entities. Sufficient experiments demonstrate the effectiveness of our proposed
DEVICE, which outperforms state-of-the-art models on the TextCaps test set. Our
code will be publicly available.
Related papers
- VCR: Visual Caption Restoration [80.24176572093512]
We introduce Visual Caption Restoration (VCR), a vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images.
This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.
arXiv Detail & Related papers (2024-06-10T16:58:48Z) - Explore and Tell: Embodied Visual Captioning in 3D Environments [83.00553567094998]
In real-world scenarios, a single image may not offer a good viewpoint, hindering fine-grained scene understanding.
We propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities.
We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task.
arXiv Detail & Related papers (2023-08-21T03:46:04Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Improving Visual Grounding with Visual-Linguistic Verification and
Iterative Reasoning [42.29650807349636]
We propose a transformer-based framework for accurate visual grounding.
We develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions.
A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness.
arXiv Detail & Related papers (2022-04-30T13:48:15Z) - Text Gestalt: Stroke-Aware Scene Text Image Super-Resolution [31.88960656995447]
We propose a Stroke-Aware Scene Text Image Super-Resolution method containing a Stroke-Focused Module (SFM) to concentrate on stroke-level internal structures of characters in text images.
Specifically, we attempt to design rules for decomposing English characters and digits at stroke-level, then pre-train a text recognizer to provide stroke-level attention maps as positional clues.
The proposed method can indeed generate more distinguishable images on TextZoom and manually constructed Chinese character dataset Degraded-IC13.
arXiv Detail & Related papers (2021-12-13T15:26:10Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.