TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating
Visio-Linguistic Reasoning
- URL: http://arxiv.org/abs/2111.10756v3
- Date: Sat, 15 Apr 2023 09:48:44 GMT
- Title: TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating
Visio-Linguistic Reasoning
- Authors: Keng Ji Chow, Samson Tan, Min-Yen Kan
- Abstract summary: We present TraVLR, a synthetic dataset comprising four visio-linguistic (V+L) reasoning tasks.
Each example in TraVLR redundantly encodes the scene in two modalities, allowing either to be dropped or added during training or testing without losing relevant information.
We compare the performance of four state-of-the-art V+L models, finding that while they perform well on test examples from the same modality, they all fail at cross-modal transfer.
- Score: 25.520406167426135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Numerous visio-linguistic (V+L) representation learning methods have been
developed, yet existing datasets do not adequately evaluate the extent to which
they represent visual and linguistic concepts in a unified space. We propose
several novel evaluation settings for V+L models, including cross-modal
transfer. Furthermore, existing V+L benchmarks often report global accuracy
scores on the entire dataset, making it difficult to pinpoint the specific
reasoning tasks that models fail and succeed at. We present TraVLR, a synthetic
dataset comprising four V+L reasoning tasks. TraVLR's synthetic nature allows
us to constrain its training and testing distributions along task-relevant
dimensions, enabling the evaluation of out-of-distribution generalisation. Each
example in TraVLR redundantly encodes the scene in two modalities, allowing
either to be dropped or added during training or testing without losing
relevant information. We compare the performance of four state-of-the-art V+L
models, finding that while they perform well on test examples from the same
modality, they all fail at cross-modal transfer and have limited success
accommodating the addition or deletion of one modality. We release TraVLR as an
open challenge for the research community.
Related papers
- VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection [5.66050466694651]
We propose Vision-Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness.
We also propose a new scoring function that enables data- and training-free outlier supervision via textual prompts.
The resulting VL4AD model achieves competitive performance on widely used benchmark datasets.
arXiv Detail & Related papers (2024-09-25T20:12:10Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases [87.65903426052155]
We perform a large-scale transfer learning experiment aimed at discovering latent vision-language skills from data.
We show that generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths.
We present a new dataset, OLIVE, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested.
arXiv Detail & Related papers (2024-04-03T02:40:35Z) - The All-Seeing Project V2: Towards General Relation Comprehension of the Open World [58.40101895719467]
We present the All-Seeing Project V2, a new model and dataset designed for understanding object relations in images.
We propose the All-Seeing Model V2 that integrates the formulation of text generation, object localization, and relation comprehension into a relation conversation task.
Our model excels not only in perceiving and recognizing all objects within the image but also in grasping the intricate relation graph between them.
arXiv Detail & Related papers (2024-02-29T18:59:17Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - Going Beyond Nouns With Vision & Language Models Using Synthetic Data [43.87754926411406]
Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications.
Recent works have uncovered a fundamental weakness of these models.
We investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings.
arXiv Detail & Related papers (2023-03-30T17:57:43Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks.
We also introduce e-SNLI-VE, the largest existing dataset with NLEs.
We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z) - Seeing past words: Testing the cross-modal capabilities of pretrained
V&L models [18.73444918172383]
We investigate the ability of general-purpose pretrained vision and language V&L models to perform reasoning in two tasks that require multimodal integration.
We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT.
Our investigations suggest that pretrained V&L representations are less successful than expected at integrating the two modalities.
arXiv Detail & Related papers (2020-12-22T21:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.