Exposing and Addressing Cross-Task Inconsistency in Unified
Vision-Language Models
- URL: http://arxiv.org/abs/2303.16133v2
- Date: Wed, 21 Feb 2024 22:53:33 GMT
- Title: Exposing and Addressing Cross-Task Inconsistency in Unified
Vision-Language Models
- Authors: Adyasha Maharana, Amita Kamath, Christopher Clark, Mohit Bansal,
Aniruddha Kembhavi
- Abstract summary: Inconsistent AI models are considered brittle and untrustworthy by human users.
We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks.
We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
- Score: 80.23791222509644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As general purpose vision models get increasingly effective at a wide set of
tasks, it is imperative that they be consistent across the tasks they support.
Inconsistent AI models are considered brittle and untrustworthy by human users
and are more challenging to incorporate into larger systems that take
dependencies on their outputs. Measuring consistency between very heterogeneous
tasks that might include outputs in different modalities is challenging since
it is difficult to determine if the predictions are consistent with one
another. As a solution, we introduce a benchmark dataset, CocoCon, where we
create contrast sets by modifying test instances for multiple tasks in small
but semantically meaningful ways to change the gold label and outline metrics
for measuring if a model is consistent by ranking the original and perturbed
instances across tasks. We find that state-of-the-art vision-language models
suffer from a surprisingly high degree of inconsistent behavior across tasks,
especially for more heterogeneous tasks. To alleviate this issue, we propose a
rank correlation-based auxiliary training objective, computed over large
automatically created cross-task contrast sets, that improves the multi-task
consistency of large unified models while retaining their original accuracy on
downstream tasks.
Related papers
- Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning [46.51245338355645]
Model merging is a flexible and computationally tractable approach to merge single-task checkpoints into a multi-task model.
We show that it qualitatively differs from the single-input-multiple-output model merging settings studied in the literature due to the existence of task-specific decoders.
We propose two simple and efficient fixes for the SIMO setting to re-align the feature representation after merging.
arXiv Detail & Related papers (2025-04-15T15:10:46Z) - Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent [74.02034188307857]
Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data.
We find existing methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance.
Our approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.
arXiv Detail & Related papers (2025-01-02T12:45:21Z) - Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics [37.86612817818566]
General purpose vision-language models, such as CLIP and large multi-modal models (LMMs), can be applied as zero-shot perceptual metrics.
We introduce UniSim-Bench, a benchmark encompassing 7 multi-modal perceptual similarity tasks with a total of 25 datasets.
Our evaluation reveals that while general-purpose models perform reasonably well on average, they often lag behind specialized models on individual tasks.
arXiv Detail & Related papers (2024-12-13T22:38:09Z) - Revisiting Weight Averaging for Model Merging [16.503826062785773]
Model merging aims to build a multi-task learner by combining the parameters of individually fine-tuned models without additional training.
Weight averaging implicitly induces task vectors centered around the weight averaging itself.
Applying a low-rank approximation to these centered task vectors significantly improves merging performance.
arXiv Detail & Related papers (2024-12-11T06:29:20Z) - Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging [33.23758947497205]
Advanced embedding models are typically developed using large-scale multi-task data and joint training across multiple tasks.
To overcome these challenges, we explore model merging-a technique that combines independently trained models to mitigate gradient conflicts and balance data distribution.
We introduce a novel method, Self Positioning, which efficiently searches for optimal model combinations within the space of task vectors using gradient descent.
arXiv Detail & Related papers (2024-10-19T08:39:21Z) - Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting [49.87694319431288]
Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources.
We propose a Comprehensive Generative (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs.
Experiments on incremental tasks (cardiac, fundus and prostate segmentation) show its clear advantage for alleviating concurrent appearance and semantic forgetting.
arXiv Detail & Related papers (2024-06-28T10:05:58Z) - Task Indicating Transformer for Task-conditional Dense Predictions [16.92067246179703]
We introduce a novel task-conditional framework called Task Indicating Transformer (TIT) to tackle this challenge.
Our approach designs a Mix Task Adapter module within the transformer block, which incorporates a Task Indicating Matrix through matrix decomposition.
We also propose a Task Gate Decoder module that harnesses a Task Indicating Vector and gating mechanism to facilitate adaptive multi-scale feature refinement.
arXiv Detail & Related papers (2024-03-01T07:06:57Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and
Vision-Language Tasks [86.66733026149892]
We propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-gnostic tasks.
Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model.
Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
arXiv Detail & Related papers (2022-11-17T18:59:52Z) - Robust Learning Through Cross-Task Consistency [92.42534246652062]
We propose a broadly applicable and fully computational method for augmenting learning with Cross-Task Consistency.
We observe that learning with cross-task consistency leads to more accurate predictions and better generalization to out-of-distribution inputs.
arXiv Detail & Related papers (2020-06-07T09:24:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.