Empirical Evaluation of Pre-trained Transformers for Human-Level NLP:
The Role of Sample Size and Dimensionality
- URL: http://arxiv.org/abs/2105.03484v1
- Date: Fri, 7 May 2021 20:06:24 GMT
- Title: Empirical Evaluation of Pre-trained Transformers for Human-Level NLP:
The Role of Sample Size and Dimensionality
- Authors: Adithya V Ganesan, Matthew Matero, Aravind Reddy Ravula, Huy Vu and H.
Andrew Schwartz
- Abstract summary: RoBERTa consistently achieves top performance in human-level tasks, with PCA giving benefit over other reduction methods in better handling users that write longer texts.
A majority of the tasks achieve results comparable to the best performance with just $frac112$ of the embedding dimensions.
- Score: 6.540382797747107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In human-level NLP tasks, such as predicting mental health, personality, or
demographics, the number of observations is often smaller than the standard
768+ hidden state sizes of each layer within modern transformer-based language
models, limiting the ability to effectively leverage transformers. Here, we
provide a systematic study on the role of dimension reduction methods
(principal components analysis, factorization techniques, or multi-layer
auto-encoders) as well as the dimensionality of embedding vectors and sample
sizes as a function of predictive performance. We first find that fine-tuning
large models with a limited amount of data pose a significant difficulty which
can be overcome with a pre-trained dimension reduction regime. RoBERTa
consistently achieves top performance in human-level tasks, with PCA giving
benefit over other reduction methods in better handling users that write longer
texts. Finally, we observe that a majority of the tasks achieve results
comparable to the best performance with just $\frac{1}{12}$ of the embedding
dimensions.
Related papers
- Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.
We introduce novel algorithms for dynamic, instance-level data reweighting.
Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples.
We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer.
We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z) - Evaluating Unsupervised Dimensionality Reduction Methods for Pretrained Sentence Embeddings [28.35953315232521]
Sentence embeddings produced by Pretrained Language Models (PLMs) have received wide attention from the NLP community.
High dimensionality of the sentence embeddings produced by PLMs is problematic when representing large numbers of sentences in memory- or compute-constrained devices.
We evaluate unsupervised dimensionality reduction methods to reduce the dimensionality of sentence embeddings produced by PLMs.
arXiv Detail & Related papers (2024-03-20T21:58:32Z) - On the Dimensionality of Sentence Embeddings [56.86742006079451]
We show that the optimal dimension of sentence embeddings is usually smaller than the default value.
We propose a two-step training method for sentence representation learning models, wherein the encoder and the pooler are optimized separately to mitigate the overall performance loss.
arXiv Detail & Related papers (2023-10-23T18:51:00Z) - Quantized Transformer Language Model Implementations on Edge Devices [1.2979415757860164]
Large-scale transformer-based models like the Bidirectional Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications.
These models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task.
One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency.
arXiv Detail & Related papers (2023-10-06T01:59:19Z) - Enhancing Representation Learning on High-Dimensional, Small-Size
Tabular Data: A Divide and Conquer Method with Ensembled VAEs [7.923088041693465]
We present an ensemble of lightweight VAEs to learn posteriors over subsets of the feature-space, which get aggregated into a joint posterior in a novel divide-and-conquer approach.
We show that our approach is robust to partial features at inference, exhibiting little performance degradation even with most features missing.
arXiv Detail & Related papers (2023-06-27T17:55:31Z) - PLATON: Pruning Large Transformer Models with Upper Confidence Bound of
Weight Importance [114.1541203743303]
We propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation.
We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification.
arXiv Detail & Related papers (2022-06-25T05:38:39Z) - Exploring Dimensionality Reduction Techniques in Multilingual
Transformers [64.78260098263489]
This paper gives a comprehensive account of the impact of dimensional reduction techniques on the performance of state-of-the-art multilingual Siamese Transformers.
It shows that it is possible to achieve an average reduction in the number of dimensions of $91.58% pm 2.59%$ and $54.65% pm 32.20%$, respectively.
arXiv Detail & Related papers (2022-04-18T17:20:55Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.