Task Vectors are Cross-Modal
- URL: http://arxiv.org/abs/2410.22330v1
- Date: Tue, 29 Oct 2024 17:59:45 GMT
- Title: Task Vectors are Cross-Modal
- Authors: Grace Luo, Trevor Darrell, Amir Bar,
- Abstract summary: We investigate the internal representations of vision-and-language models (VLMs)
We consider tasks specified through examples or instructions, using either text or image inputs.
We find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified.
- Score: 58.19152818504624
- License:
- Abstract: We investigate the internal representations of vision-and-language models (VLMs) and how they encode task representations. We consider tasks specified through examples or instructions, using either text or image inputs. Surprisingly, we find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified. Our findings suggest that to output answers, tokens in VLMs undergo three distinct phases: input, task, and answer, a process which is consistent across different modalities and specifications. The task vectors we identify in VLMs are general enough to be derived in one modality (e.g., text) and transferred to another (e.g., image). Additionally, we find that ensembling exemplar and instruction based task vectors produce better task representations. Taken together, these insights shed light on the underlying mechanisms of VLMs, particularly their ability to represent tasks in a shared manner across different modalities and task specifications. Project page: https://task-vectors-are-cross-modal.github.io.
Related papers
- Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning [4.676050557609447]
Vision-Language Models (VLMs) have transformed tasks requiring visual and reasoning abilities, such as image retrieval and Visual Question Answering (VQA)
These limitations stem from difficulties effectively integrating multiple modalities and accurately interpreting geometry-related tasks.
We present a promising alternative: task-based prompting, enriching the prompt with task-specific guidance.
arXiv Detail & Related papers (2024-10-08T11:29:40Z) - RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks.
RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model.
Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z) - Finding Visual Task Vectors [74.67336516908776]
Visual Prompting is a technique for teaching models to perform a visual task via in-context examples, without any additional training.
We analyze the activations of MAE-VQGAN, a recent Visual Prompting model, and find task vectors, activations that encode task-specific information.
arXiv Detail & Related papers (2024-04-08T17:59:46Z) - Identifying and Analyzing Task-Encoding Tokens in Large Language Models [55.03191279766383]
In this paper, we identify and analyze task-encoding tokens on whose representations the task performance depends.
We show that template and stopword tokens are the most prone to be task-encoding.
Our work sheds light on how large language models (LLMs) learn to perform a task from demonstrations, deepens our understanding of the varied roles different types of tokens play in LLMs, and provides insights for avoiding instability from improperly utilizing task-encoding tokens.
arXiv Detail & Related papers (2024-01-20T20:55:21Z) - A Unified Sequence Interface for Vision Tasks [87.328893553186]
We show that a diverse set of "core" computer vision tasks can be unified if formulated in terms of a shared pixel-to-sequence interface.
We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs.
We show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization.
arXiv Detail & Related papers (2022-06-15T17:08:53Z) - Compressed Hierarchical Representations for Multi-Task Learning and Task
Clustering [5.878411350387833]
We frame homogeneous-feature multi-task learning as a hierarchical representation learning problem.
We assume an additive independent noise model between the task-agnostic and task-specific latent representations.
It is shown that our resulting representations yield competitive performance for several MTL benchmarks.
arXiv Detail & Related papers (2022-05-31T15:31:17Z) - Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering [43.07139534653485]
We present Answer-Me, a task-aware multi-task framework.
We pre-train a vision-language joint model, which is multi-task as well.
Results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results.
arXiv Detail & Related papers (2022-05-02T14:53:13Z) - Distribution Matching for Heterogeneous Multi-Task Learning: a
Large-scale Face Study [75.42182503265056]
Multi-Task Learning has emerged as a methodology in which multiple tasks are jointly learned by a shared learning algorithm.
We deal with heterogeneous MTL, simultaneously addressing detection, classification & regression problems.
We build FaceBehaviorNet, the first framework for large-scale face analysis, by jointly learning all facial behavior tasks.
arXiv Detail & Related papers (2021-05-08T22:26:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.