Refined Vision-Language Modeling for Fine-grained Multi-modal
Pre-training
- URL: http://arxiv.org/abs/2303.05313v2
- Date: Sat, 6 May 2023 15:36:10 GMT
- Title: Refined Vision-Language Modeling for Fine-grained Multi-modal
Pre-training
- Authors: Lisai Zhang, Qingcai Chen, Zhijian Chen, Yunpeng Han, Zhonghua Li,
Zhao Cao
- Abstract summary: Fine-grained supervision based on object annotations has been widely used for vision and language pre-training.
In real-world application scenarios, aligned multi-modal data is usually in the image-caption format, which only provides coarse-grained supervision.
- Score: 12.760340242744313
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-grained supervision based on object annotations has been widely used for
vision and language pre-training (VLP). However, in real-world application
scenarios, aligned multi-modal data is usually in the image-caption format,
which only provides coarse-grained supervision. It is not only cost-expensive
but also compute-expensive to collect object annotations and build object
annotation pre-extractor for different scenarios. In this paper, we propose a
fine-grained VLP scheme without object annotations from the linguistic
perspective. First, we propose a homonym sentence rewriting (HSR) algorithm to
provide token-level supervision. The algorithm replaces a
verb/noun/adjective/quantifier word of the caption with its homonyms from
WordNet. Correspondingly, we propose refined vision-language modeling (RVLM)
framework to exploit the token-level supervision. Three refined tasks, i.e.,
refined image-text contrastive (RITC), refined image-text matching (RITM), and
replace language modeling (RLM) are proposed to learn the fine-grained
alignment. Extensive experiments on several downstream tasks demonstrate the
superior performance of the proposed method.
Related papers
- Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object
Knowledge Distillation [42.01427946204401]
Self-supervised vision-and-language pretraining aims to learn transferable multi-modal representations from large-scale image-text data.
We propose an object-aware end-to-end QF framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly.
To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision.
arXiv Detail & Related papers (2021-09-22T03:38:05Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.