One Embedder, Any Task: Instruction-Finetuned Text Embeddings
- URL: http://arxiv.org/abs/2212.09741v3
- Date: Tue, 30 May 2023 15:22:50 GMT
- Title: One Embedder, Any Task: Instruction-Finetuned Text Embeddings
- Authors: Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari
Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, Tao Yu
- Abstract summary: INSTRUCTOR is a new method for computing text embeddings given task instructions.
Every text input is embedded together with instructions explaining the use case.
We evaluate INSTRUCTOR on 70 embedding evaluation tasks.
- Score: 105.82772523968961
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce INSTRUCTOR, a new method for computing text embeddings given
task instructions: every text input is embedded together with instructions
explaining the use case (e.g., task and domain descriptions). Unlike encoders
from prior work that are more specialized, INSTRUCTOR is a single embedder that
can generate text embeddings tailored to different downstream tasks and
domains, without any further training. We first annotate instructions for 330
diverse tasks and train INSTRUCTOR on this multitask mixture with a contrastive
loss. We evaluate INSTRUCTOR on 70 embedding evaluation tasks (66 of which are
unseen during training), ranging from classification and information retrieval
to semantic textual similarity and text generation evaluation. INSTRUCTOR,
while having an order of magnitude fewer parameters than the previous best
model, achieves state-of-the-art performance, with an average improvement of
3.4% compared to the previous best results on the 70 diverse datasets. Our
analysis suggests that INSTRUCTOR is robust to changes in instructions, and
that instruction finetuning mitigates the challenge of training a single model
on diverse datasets. Our model, code, and data are available at
https://instructor-embedding.github.io.
Related papers
- Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Efficient Pre-training for Localized Instruction Generation of Videos [32.13509517228516]
Procedural videos are instrumental in conveying step-by-step instructions.
Process Transformer (ProcX) is a model for end-to-end step localization and instruction generation for procedural videos.
arXiv Detail & Related papers (2023-11-27T16:07:37Z) - MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction
Tuning [24.741736629886564]
Instruction tuning is a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions.
We introduce MUL-TIINSTRUCT, the first multimodal instruction tuning benchmark dataset.
We show strong zero-shot performance on various unseen multimodal tasks and the benefit of transfer learning from a text-only instruction dataset.
arXiv Detail & Related papers (2022-12-21T05:17:06Z) - MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are
Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER.
MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z) - Composable Text Controls in Latent Space with ODEs [97.12426987887021]
This paper proposes a new efficient approach for composable text operations in the compact latent space of text.
By connecting pretrained LMs to the latent space through efficient adaption, we then decode the sampled vectors into desired text sequences.
Experiments show that composing those operators within our approach manages to generate or edit high-quality text.
arXiv Detail & Related papers (2022-08-01T06:51:45Z) - Leveraging Natural Supervision for Language Representation Learning and
Generation [8.083109555490475]
We describe three lines of work that seek to improve the training and evaluation of neural models using naturally-occurring supervision.
We first investigate self-supervised training losses to help enhance the performance of pretrained language models for various NLP tasks.
We propose a framework that uses paraphrase pairs to disentangle semantics and syntax in sentence representations.
arXiv Detail & Related papers (2022-07-21T17:26:03Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - Constructing Flow Graphs from Procedural Cybersecurity Texts [16.09313316086535]
We build a large annotated procedural text dataset (CTFW) in the cybersecurity domain (3154 documents)
We propose to identify relevant information from such texts and generate information flows between sentences.
Our experiments show that Graph Convolution Network with BERT sentence embeddings outperforms BERT in all three domains.
arXiv Detail & Related papers (2021-05-29T19:06:35Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.