Evaluating the Capabilities of Multi-modal Reasoning Models with
Synthetic Task Data
- URL: http://arxiv.org/abs/2306.01144v1
- Date: Thu, 1 Jun 2023 20:56:34 GMT
- Title: Evaluating the Capabilities of Multi-modal Reasoning Models with
Synthetic Task Data
- Authors: Nathan Vaska, Victoria Helus
- Abstract summary: We leverage advances in high resolution text-to-image generation to develop a framework for generating evaluation data for multi-modal reasoning tasks.
We apply this framework to generate context-dependent anomaly data, creating a synthetic dataset on a challenging task.
We demonstrate that while the task is tractable, the model performs significantly worse on the context-dependent anomaly detection task than on standard VQA tasks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The impressive advances and applications of large language and joint
language-and-visual understanding models has led to an increased need for
methods of probing their potential reasoning capabilities. However, the
difficulty of gather naturally-occurring data for complex multi-modal reasoning
tasks bottlenecks the evaluation of AI methods on tasks which are not already
covered by an academic dataset. In this work, we leverage recent advances in
high resolution text-to-image generation to develop a framework for generating
evaluation data for multi-modal reasoning tasks. We apply this framework to
generate context-dependent anomaly data, creating a synthetic dataset on a
challenging task which is not well covered by existing datasets. We benchmark
the performance of a state-of-the-art visual question answering (VQA) model
against data generated with this method, and demonstrate that while the task is
tractable, the model performs significantly worse on the context-dependent
anomaly detection task than on standard VQA tasks.
Related papers
- Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks [50.75902473813379]
This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models.
The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes.
arXiv Detail & Related papers (2024-07-04T14:36:49Z) - Improving QA Model Performance with Cartographic Inoculation [0.0]
"Dataset artifacts" reduce the model's ability to generalize to real-world QA problems.
We analyze the impacts and incidence of dataset artifacts using an adversarial challenge set.
We show that by selectively fine-tuning a model on ambiguous adversarial examples from a challenge set, significant performance improvements can be made.
arXiv Detail & Related papers (2024-01-30T23:08:26Z) - Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers [54.83459025465947]
Even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting.
Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools.
We present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples.
arXiv Detail & Related papers (2024-01-03T20:48:47Z) - Revisit Input Perturbation Problems for LLMs: A Unified Robustness
Evaluation Framework for Noisy Slot Filling Task [18.623619585980688]
We propose a unified robustness evaluation framework based on the slot-filling task to evaluate the dialogue understanding capability of large language models.
Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data.
Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios.
arXiv Detail & Related papers (2023-10-10T10:22:05Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Exposing and Addressing Cross-Task Inconsistency in Unified
Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users.
We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks.
We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z) - GLUECons: A Generic Benchmark for Learning Under Constraints [102.78051169725455]
In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision.
We model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints.
arXiv Detail & Related papers (2023-02-16T16:45:36Z) - Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level
Natural Language Explanations [12.757277574843101]
Natural language explanations promise to offer intuitively understandable explanations of a neural network's decision process in complex vision-language tasks.
Current models offer impressive performance on task accuracy and explanation plausibility, but suffer from a range of issues.
We apply recent advances in large-scale multi-task pretraining of generative Transformer models to the problem of VL-NLE tasks.
Our approach outperforms recent models by a large margin, with human annotators preferring the generated explanations over the ground truth in two out of three evaluated datasets.
arXiv Detail & Related papers (2022-12-08T12:28:23Z) - Eliminating Catastrophic Interference with Biased Competition [0.0]
We present a model to take advantage of the multi-task nature of complex datasets by learning to separate tasks and subtasks in and end to end manner by biasing competitive interactions in the network.
We demonstrate that this model eliminates catastrophic interference between tasks on a newly created dataset and provides competitive results in the Visual Question Answering space.
arXiv Detail & Related papers (2020-07-03T16:15:15Z) - DQI: Measuring Data Quality in NLP [22.54066527822898]
We introduce a generic formula for Data Quality Index (DQI) to help dataset creators create datasets free of unwanted biases.
We show that models trained on the renovated SNLI dataset generalize better to out of distribution tasks.
arXiv Detail & Related papers (2020-05-02T12:34:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.