Related papers: Towards Models that Can See and Read

Towards Models that Can See and Read

URL: http://arxiv.org/abs/2301.07389v2
Date: Tue, 21 Mar 2023 11:40:47 GMT
Title: Towards Models that Can See and Read
Authors: Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon, Shai Mazor, Ron Litman
Abstract summary: Visual Question Answering (VQA) and Image Captioning (CAP) have analogous scene-text versions that require reasoning from the text in the image. We propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal scene-text understanding capabilities. We show that scene-text understanding capabilities can boost vision-language models' performance on general VQA and CAP by up to 2.69% and 0.6 CIDEr, respectively.
Score: 12.078407046266982
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-language models' performance on general VQA and CAP by up to 2.69% and 0.6 CIDEr, respectively.

Related papers

OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding. Existing solutions often rely on task-specific architectures and objectives for individual tasks. In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z)
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model [108.85584502396182]
We propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM) By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters. Our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks.
arXiv Detail & Related papers (2023-10-08T11:33:09Z)
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. We propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z)
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks [87.6494641931349]
We introduce a general-purpose multimodal foundation model BEiT-3. It achieves state-of-the-art transfer performance on both vision and vision-language tasks.
arXiv Detail & Related papers (2022-08-22T16:55:04Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels. We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction. We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
External Knowledge Augmented Text Visual Question Answering [0.6445605125467573]
We propose a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks. We generate results comparable to the state-of-the-art on two publicly available datasets.
arXiv Detail & Related papers (2021-08-22T13:21:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.