General Purpose Text Embeddings from Pre-trained Language Models for
Scalable Inference
- URL: http://arxiv.org/abs/2004.14287v1
- Date: Wed, 29 Apr 2020 16:11:26 GMT
- Title: General Purpose Text Embeddings from Pre-trained Language Models for
Scalable Inference
- Authors: Jingfei Du, Myle Ott, Haoran Li, Xing Zhou, Veselin Stoyanov
- Abstract summary: We show that some of the computational cost during inference can be amortized over the different tasks using a shared text encoder.
We also compare approaches for training such an encoder and show that encoders pre-trained over multiple tasks generalize well to unseen tasks.
- Score: 34.47592026375839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The state of the art on many NLP tasks is currently achieved by large
pre-trained language models, which require a considerable amount of
computation. We explore a setting where many different predictions are made on
a single piece of text. In that case, some of the computational cost during
inference can be amortized over the different tasks using a shared text
encoder. We compare approaches for training such an encoder and show that
encoders pre-trained over multiple tasks generalize well to unseen tasks. We
also compare ways of extracting fixed- and limited-size representations from
this encoder, including different ways of pooling features extracted from
multiple layers or positions. Our best approach compares favorably to knowledge
distillation, achieving higher accuracy and lower computational cost once the
system is handling around 7 tasks. Further, we show that through binary
quantization, we can reduce the size of the extracted representations by a
factor of 16 making it feasible to store them for later use. The resulting
method offers a compelling solution for using large-scale pre-trained models at
a fraction of the computational cost when multiple tasks are performed on the
same text.
Related papers
- Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - Efficient Controllable Multi-Task Architectures [85.76598445904374]
We propose a multi-task model consisting of a shared encoder and task-specific decoders where both encoder and decoder channel widths are slimmable.
Our key idea is to control the task importance by varying the capacities of task-specific decoders, while controlling the total computational cost.
This improves overall accuracy by allowing a stronger encoder for a given budget, increases control over computational cost, and delivers high-quality slimmed sub-architectures.
arXiv Detail & Related papers (2023-08-22T19:09:56Z) - Extracting Text Representations for Terms and Phrases in Technical
Domains [9.27244202193623]
We propose a fully unsupervised approach to text encoding that consists of training small character-based models with the objective of reconstructing large pre-trained embedding matrices.
Models trained with this approach can not only match the quality of sentence encoders in technical domains, but are 5 times smaller and up to 10 times faster.
arXiv Detail & Related papers (2023-05-25T08:59:36Z) - Learning Easily Updated General Purpose Text Representations with
Adaptable Task-Specific Prefixes [22.661527526471996]
Fine-tuning a large pre-trained language model for each downstream task causes computational burdens.
We propose a prefix-based method to learn the fixed text representations with source tasks.
arXiv Detail & Related papers (2023-05-22T21:31:03Z) - Discrete Key-Value Bottleneck [95.61236311369821]
Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant.
One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning.
Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks.
We propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes.
arXiv Detail & Related papers (2022-07-22T17:52:30Z) - Grad2Task: Improved Few-shot Text Classification Using Gradients for
Task Representation [24.488427641442694]
We propose a novel conditional neural process-based approach for few-shot text classification.
Our key idea is to represent each task using gradient information from a base model.
Our approach outperforms traditional fine-tuning, sequential transfer learning, and state-of-the-art meta learning approaches.
arXiv Detail & Related papers (2022-01-27T15:29:30Z) - Fine-grained Multi-Modal Self-Supervised Learning [4.850800439026724]
Multi-Modal Self-Supervised Learning from videos has been shown to improve model's performance on various downstream tasks.
Such pre-training requires large batch sizes and a large amount of computation resources due to the noise present in uncurated data.
We propose a fine-grained multi-modal self-supervised training scheme that computes the similarity between embeddings at finer-scale.
arXiv Detail & Related papers (2021-12-22T19:17:45Z) - Efficient Retrieval Optimized Multi-task Learning [16.189136169520424]
We propose a novel Retrieval Optimized Multi-task (ROM) framework for jointly training self-supervised tasks, knowledge retrieval, and extractive question answering.
Our ROM approach presents a unified and generalizable framework that enables scaling efficiently to multiple tasks.
Using our framework, we achieve comparable or better performance than recent methods on QA, while drastically reducing the number of parameters.
arXiv Detail & Related papers (2021-04-20T17:16:34Z) - Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples.
In this work we investigate few-shot learning in the setting where the data points are sequences of tokens.
We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.