VinVL: Revisiting Visual Representations in Vision-Language Models
- URL: http://arxiv.org/abs/2101.00529v2
- Date: Wed, 10 Mar 2021 01:27:16 GMT
- Title: VinVL: Revisiting Visual Representations in Vision-Language Models
- Authors: Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang,
Lijuan Wang, Yejin Choi, Jianfeng Gao
- Abstract summary: We develop an improved object detection model to provide object-centric representations of images.
New visual features significantly improve the performance across all vision language (VL) tasks.
We will release the new object detection model to public.
- Score: 96.39332942534368
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a detailed study of improving visual representations for
vision language (VL) tasks and develops an improved object detection model to
provide object-centric representations of images. Compared to the most widely
used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new
model is bigger, better-designed for VL tasks, and pre-trained on much larger
training corpora that combine multiple public annotated object detection
datasets. Therefore, it can generate representations of a richer collection of
visual objects and concepts. While previous VL research focuses mainly on
improving the vision-language fusion model and leaves the object detection
model improvement untouched, we show that visual features matter significantly
in VL models. In our experiments we feed the visual features generated by the
new object detection model into a Transformer-based VL fusion model \oscar
\cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the
VL model and fine-tune it on a wide range of downstream VL tasks. Our results
show that the new visual features significantly improve the performance across
all VL tasks, creating new state-of-the-art results on seven public benchmarks.
We will release the new object detection model to public.
Related papers
- ViTOC: Vision Transformer and Object-aware Captioner [0.0]
ViTOC is a vision-language model for image captioning that addresses the challenges of accuracy and diversity in generated descriptions.
By utilizing pretrained visual model parameters, ViTOC achieves efficient end-to-end training.
arXiv Detail & Related papers (2024-11-09T13:13:49Z) - ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.
Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.
We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution [82.38677987249348]
We present the Qwen2-VL Series, which redefines the conventional predetermined-resolution approach in visual processing.
Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens.
The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos.
arXiv Detail & Related papers (2024-09-18T17:59:32Z) - Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems [16.49637074299509]
We have explored state-of-the-art vision language models (VLM) for vision-based transportation engineering tasks.
The image classification task involves congestion detection and crack identification, whereas, for object detection, helmet violations were identified.
We have applied open-source models such as CLIP, BLIP, OWL-ViT, Llava-Next, and closed-source GPT-4o to evaluate the performance of these VLM models.
arXiv Detail & Related papers (2024-09-03T20:24:37Z) - Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation [34.37450315995176]
Current Referring Video Object (RVOS) methods typically use vision and language models pretrained independently as backbones.
We propose a temporal-aware prompt-tuning method, which adapts pretrained representations for pixel-level prediction.
Our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.
arXiv Detail & Related papers (2024-05-17T08:14:22Z) - Improving Commonsense in Vision-Language Models via Knowledge Graph
Riddles [83.41551911845157]
This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models.
We propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE)
For better commonsense evaluation, we propose the first retrieval-based commonsense diagnostic benchmark.
arXiv Detail & Related papers (2022-11-29T18:59:59Z) - Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC)
SVLC includes object attributes, relations, and states which are present in the text and visible in the image.
We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z) - e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks.
We also introduce e-SNLI-VE, the largest existing dataset with NLEs.
We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.