Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
Transformers
- URL: http://arxiv.org/abs/2004.00849v2
- Date: Mon, 22 Jun 2020 09:09:22 GMT
- Title: Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
Transformers
- Authors: Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu
- Abstract summary: We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding.
Our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR)
- Score: 46.275416873403614
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Pixel-BERT to align image pixels with text by deep multi-modal
transformers that jointly learn visual and language embedding in a unified
end-to-end framework. We aim to build a more accurate and thorough connection
between image pixels and language semantics directly from image and sentence
pairs instead of using region-based image features as the most recent vision
and language tasks. Our Pixel-BERT which aligns semantic connection in pixel
and text level solves the limitation of task-specific visual representation for
vision and language tasks. It also relieves the cost of bounding box
annotations and overcomes the unbalance between semantic labels in visual task
and language semantic. To provide a better representation for down-stream
tasks, we pre-train a universal end-to-end model with image and sentence pairs
from Visual Genome dataset and MS-COCO dataset. We propose to use a random
pixel sampling mechanism to enhance the robustness of visual representation and
to apply the Masked Language Model and Image-Text Matching as pre-training
tasks. Extensive experiments on downstream tasks with our pre-trained model
show that our approach makes the most state-of-the-arts in downstream tasks,
including Visual Question Answering (VQA), image-text retrieval, Natural
Language for Visual Reasoning for Real (NLVR). Particularly, we boost the
performance of a single model in VQA task by 2.17 points compared with SOTA
under fair comparison.
Related papers
- PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions [66.92809850624118]
PixWizard is an image-to-image visual assistant designed for image generation, manipulation, and translation based on free-from language instructions.
We tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning dataset.
Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions.
arXiv Detail & Related papers (2024-09-23T17:59:46Z) - Context Does Matter: End-to-end Panoptic Narrative Grounding with
Deformable Attention Refined Matching Network [25.511804582983977]
Panoramic Narrative Grounding (PNG) aims to segment visual objects in images based on dense narrative captions.
We propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN)
DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-$k$ most similar pixels.
arXiv Detail & Related papers (2023-10-25T13:12:39Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Understanding Mobile GUI: from Pixel-Words to Screen-Sentences [48.97215653702567]
We propose a mobile GUI understanding architecture: Pixel-Words to Screen-Sentence (PW2SS)
Pixel-Words are defined as atomic visual components, which are visually consistent and semantically clear across screenshots.
We are able to make use of metadata available in training data to auto-generate high-quality annotations for Pixel-Words.
arXiv Detail & Related papers (2021-05-25T13:45:54Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.