What do Toothbrushes do in the Kitchen? How Transformers Think our World
is Structured
- URL: http://arxiv.org/abs/2204.05673v1
- Date: Tue, 12 Apr 2022 10:00:20 GMT
- Title: What do Toothbrushes do in the Kitchen? How Transformers Think our World
is Structured
- Authors: Alexander Henlein, Alexander Mehler
- Abstract summary: We investigate what extent transformer-based language models allow for extracting knowledge about object relations.
We show that the models combined with the different similarity measures differ greatly in terms of the amount of knowledge they allow for extracting.
Surprisingly, static models perform almost as well as contextualized models -- in some cases even better.
- Score: 137.83584233680116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models are now predominant in NLP. They outperform
approaches based on static models in many respects. This success has in turn
prompted research that reveals a number of biases in the language models
generated by transformers. In this paper we utilize this research on biases to
investigate to what extent transformer-based language models allow for
extracting knowledge about object relations (X occurs in Y; X consists of Z;
action A involves using X). To this end, we compare contextualized models with
their static counterparts. We make this comparison dependent on the application
of a number of similarity measures and classifiers. Our results are threefold:
Firstly, we show that the models combined with the different similarity
measures differ greatly in terms of the amount of knowledge they allow for
extracting. Secondly, our results suggest that similarity measures perform much
worse than classifier-based approaches. Thirdly, we show that, surprisingly,
static models perform almost as well as contextualized models -- in some cases
even better.
Related papers
- Exploring the Learning Capabilities of Language Models using LEVERWORLDS [23.40759867281453]
Learning a model of a setting often involves learning both general structure rules and specific properties of the instance.
This paper investigates the interplay between learning the general and the specific in various learning methods, with emphasis on sample efficiency.
arXiv Detail & Related papers (2024-10-01T09:02:13Z) - Approximate Attributions for Off-the-Shelf Siamese Transformers [2.1163800956183776]
Siamese encoders such as sentence transformers are among the least understood deep models.
We propose a model with exact attribution ability that retains the original model's predictive performance.
We also propose a way to compute approximate attributions for off-the-shelf models.
arXiv Detail & Related papers (2024-02-05T10:49:05Z) - Perturbed examples reveal invariances shared by language models [8.04604449335578]
We introduce a novel framework to compare two NLP models.
Via experiments on models from the same and different architecture families, this framework offers insights about how changes in models affect linguistic capabilities.
arXiv Detail & Related papers (2023-11-07T17:48:35Z) - Classical Sequence Match is a Competitive Few-Shot One-Class Learner [15.598750267663286]
We investigate the few-shot one-class problem, which actually takes a known sample as a reference to detect whether an unknown instance belongs to the same class.
It is shown that with meta-learning, the classical sequence match method, i.e. Compare-Aggregate, significantly outperforms transformer ones.
arXiv Detail & Related papers (2022-09-14T03:21:47Z) - Ensemble Transformer for Efficient and Accurate Ranking Tasks: an
Application to Question Answering Systems [99.13795374152997]
We propose a neural network designed to distill an ensemble of large transformers into a single smaller model.
An MHS model consists of two components: a stack of transformer layers that is used to encode inputs, and a set of ranking heads.
Unlike traditional distillation techniques, our approach leverages individual models in ensemble as teachers in a way that preserves the diversity of the ensemble members.
arXiv Detail & Related papers (2022-01-15T06:21:01Z) - Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost.
In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts)
Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z) - A Neural Few-Shot Text Classification Reality Check [4.689945062721168]
Several neural few-shot classification models have emerged, yielding significant progress over time.
In this paper, we compare all these models, first adapting those made in the field of image processing to NLP, and second providing them access to transformers.
We then test these models equipped with the same transformer-based encoder on the intent detection task, known for having a large number of classes.
arXiv Detail & Related papers (2021-01-28T15:46:14Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z) - On the Discrepancy between Density Estimation and Sequence Generation [92.70116082182076]
log-likelihood is highly correlated with BLEU when we consider models within the same family.
We observe no correlation between rankings of models across different families.
arXiv Detail & Related papers (2020-02-17T20:13:35Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.