Related papers: VinVL: Revisiting Visual Representations in Vision-Language Models

VinVL: Revisiting Visual Representations in Vision-Language Models

URL: http://arxiv.org/abs/2101.00529v2
Date: Wed, 10 Mar 2021 01:27:16 GMT
Title: VinVL: Revisiting Visual Representations in Vision-Language Models
Authors: Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao
Abstract summary: We develop an improved object detection model to provide object-centric representations of images. New visual features significantly improve the performance across all vision language (VL) tasks. We will release the new object detection model to public.
Score: 96.39332942534368
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.

Related papers

Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives [36.297745473653166]
Vision-language modeling (VLM) aims to bridge the information gap between images and natural language.<n>Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress.
arXiv Detail & Related papers (2025-05-20T13:47:40Z)
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z)
ViTOC: Vision Transformer and Object-aware Captioner [0.0]
ViTOC is a vision-language model for image captioning that addresses the challenges of accuracy and diversity in generated descriptions. By utilizing pretrained visual model parameters, ViTOC achieves efficient end-to-end training.
arXiv Detail & Related papers (2024-11-09T13:13:49Z)
ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z)
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution [82.38677987249348]
We present the Qwen2-VL Series, which redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos.
arXiv Detail & Related papers (2024-09-18T17:59:32Z)
Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems [16.49637074299509]
We have explored state-of-the-art vision language models (VLM) for vision-based transportation engineering tasks. The image classification task involves congestion detection and crack identification, whereas, for object detection, helmet violations were identified. We have applied open-source models such as CLIP, BLIP, OWL-ViT, Llava-Next, and closed-source GPT-4o to evaluate the performance of these VLM models.
arXiv Detail & Related papers (2024-09-03T20:24:37Z)
Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation [34.37450315995176]
Current Referring Video Object (RVOS) methods typically use vision and language models pretrained independently as backbones. We propose a temporal-aware prompt-tuning method, which adapts pretrained representations for pixel-level prediction. Our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.
arXiv Detail & Related papers (2024-05-17T08:14:22Z)
Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles [83.41551911845157]
This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. We propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE) For better commonsense evaluation, we propose the first retrieval-based commonsense diagnostic benchmark.
arXiv Detail & Related papers (2022-11-29T18:59:59Z)
Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC) SVLC includes object attributes, relations, and states which are present in the text and visible in the image. We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z)
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling. PEVL reformulates discretized object positions and language in a unified language modeling framework. We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z)
e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks. We also introduce e-SNLI-VE, the largest existing dataset with NLEs. We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.