Transformer-based language models (LMs) continue to achieve state-of-the-art
performance on natural language processing (NLP) benchmarks, including tasks
designed to mimic human-inspired "commonsense" competencies. To better
understand the degree to which LMs can be said to have certain linguistic
reasoning skills, researchers are beginning to adapt the tools and concepts
from psychometrics. But to what extent can benefits flow in the other
direction? In other words, can LMs be of use in predicting the psychometric
properties of test items, when those items are given to human participants? If
so, the benefit for psychometric practitioners is enormous, as it can reduce
the need for multiple rounds of empirical testing. We gather responses from
numerous human participants and LMs (transformer- and non-transformer-base d) on
a broad diagnostic test of linguistic competencies. We then use the human
responses to calculate standard psychometric properties of the items in the
diagnostic test, using the human responses and the LM responses separately. We
then determine how well these two sets of predictions correlate. We find that
transformer-based LMs predict the human psychometric data consistently well
across most categories, suggesting that they can be used to gather human-like
psychometric data without the need for extensive human trials.
USA Abstract. Transformer-based language models (LMs) continue to achieve stateof-the-art performance on natural language processing (NLP) benchmarks, including tasks designed to mimic human-inspired “commonsense” competencies.
To better understand the degree to which LMs can be said to have certain linguistic reasoning skills, researchers are beginning to adapt the tools and concepts from psychometrics.
We gather responses from numerous human participants and LMs (transformer- and non-transformer-base d) on a broad diagnostic test of linguistic competencies.
We then use the human responses to calculate standard psychometric properties of the items in the diagnostic test, using the human responses and the LM responses separately.
We then determine how well these two sets of predictions correlate.
次に、これらの2つの予測セットの相関性を決定する。
0.58
We find that transformer-based LMs predict the human psychometric data consistently well across most categories, suggesting that they can be used to gather human-like psychometric data without the need for extensive human trials.
Keywords: classical test theory, item response theory, natural language processing
キーワード:古典的テスト理論、項目応答理論、自然言語処理
0.88
1 Introduction The current generation of transformer-based language models (TLMs) (Vaswani et al, 2017) continues to surpass expectations, consistently achieving state-of-the-art results on many natural language processing (NLP) tasks.
1 はじめに トランスフォーマーベース言語モデル(TLM)の現在の世代(Vaswani et al, 2017)は、多くの自然言語処理(NLP)タスクにおける最先端の結果を一貫して達成し、期待を超え続けている。
0.56
Transformers are a type of artificial neural network that connect text encoders and decoders without using recurrent links, as was the case in previous architectures such as Long Short Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997).
Transformerはテキストエンコーダとデコーダをリカレントリンクを使わずに接続する人工ニューラルネットワークの一種であり、例えばLong Short Term Memory (LSTM) Network (Hochreiter and Schmidhuber, 1997) のような以前のアーキテクチャではそうであった。
0.82
Instead, they rely on a computationally efficient self-attention mechanism (Vaswani et al, 2017).
代わりに、計算効率のよい自己認識機構(Vaswani et al, 2017)に依存している。
0.67
Especially surprising is the remarkable performance of these models on benchmark tasks designed to assess “commonsense”
reasoning (e g , Wang et al, 2018, 2019), possibly owing to their ability to encode and retrieve a surprising amount of structural knowledge (Goldberg, 2019; Hu et al, 2020; Cui et al, 2020).
推論(e.g. wang et al, 2018, 2019)は、驚くべき量の構造知識をエンコードし、取得する能力があるためかもしれない(goldberg, 2019; hu et al, 2020; cui et al, 2020)。
0.86
Understanding how TLMs reason is a complex task made more difficult by the fact that the sizes of contemporary TLMs are so large that they are effectively black boxes.
As such, researchers are continually searching for new methods to understand the strengths and limitations of TLMs.
そのため、研究者はTLMの強さと限界を理解する新しい方法を模索している。
0.73
One promising approach is to draw from the tools of psychometrics, which allows us to measure latent attributes like reasoning skills, even if the mechanisms giving rise to these attributes is not well understood.
Although some have called for bridging the gap between psychometrics and artificial intelligence (AI) (Bringsjord, 2011; Bringsjord and Licato, 2012; Hern´andez-Orallo et al, 2016; Wilcox et al, 2020), the amount of work attempting to do so has been limited.
サイコメトリックスと人工知能(ai)のギャップ(bringsjord, 2011; bringsjord and licato, 2012; hern 'andez-orallo et al, 2016; wilcox et al, 2020)の橋渡しを求める声もあるが、そうしようとする作業量は限られている。 訳抜け防止モード: 心理学と人工知能(AI)のギャップを埋めることを求める声もあるが(Bringsjord, 2011 ;Bringsjord) And Licato, 2012 ; Hern ́andez - Orallo et al, 2016 ; Wilcox et al, 2020 )。 そうしようとする仕事の量は 限られています。
0.80
While methods from psychometrics could certainly be useful as a diagnostic tool for AI practitioners, the remarkable performance of TLMs on reasoning tasks suggests that they might also be useful to psychometricians when designing evaluation scales.
Most prior work has focused on the benefits psychometrics can bring to AI, however, and has not considered whether tools from AI can also benefit psychometrics, which is the focus of the present paper.
To illustrate how AI might be applied to psychometrics, assume that someone wishes to design a test to assess the degree to which a person possesses mastery of some cognitive skill S. A good place to start is for a panel of experts to design a set of test items I, such that they believe solving I requires S, and can therefore be used to measure mastery of S. A common task in psychometrics is to design measurement tools such as I, and then to apply I to a large number of human participants.
The data obtained from these trials can be used to estimate psychometric properties of the items in I, such as their reliability, validity, and fairness.
But establishing these properties can be prohibitively costly, requiring large numbers of human participants to answer the items in I and iteratively refine them.
To do this, we identified a subset of items from the General Language Understanding Evaluation (GLUE) broad coverage diagnostic (Wang et al, 2018), a challenging benchmark of linguistic reasoning skills used to measure the progress of language modeling in the NLP community.
そこで我々は,NLPコミュニティにおける言語モデリングの進歩を測る上で,言語推論スキルの挑戦的なベンチマークである,言語理解評価(GLUE)の広範なカバレッジ診断(Wang et al, 2018)から項目のサブセットを特定した。
0.87
We collected human responses on these items to assess simple psychometric properties, designing a novel user validation procedure to do so.
We then assess the performance of 240 language models (LMs) on these diagnostic items.
次に,これらの診断項目に対する240言語モデル(lms)の性能を評価する。
0.79
Our resulting analysis suggests TLMs show promise in modeling human psychometric properties in certain sub-categories of linguistic skills, thus providing fruitful directions for future work.
Predicting Human Psychometric Properties Using Computational Language Models
計算言語モデルを用いた人間の心理的特性の予測
0.63
3 2 Background in Natural Language Processing
3 自然言語処理の背景2
0.60
As our work draws heavily on models, datasets, and techniques from NLP, we will begin by briefly introducing some important concepts that will be used throughout this work.
Note that this is not meant to be an exhaustive introduction to the field; the interested reader is encouraged to refer to the citations throughout this section for more details.
In NLP, language models (LMs) are the primary tool used to perform tasks related to natural language understanding (e g , sentiment analysis, machine translation, and so forth).
All the models used throughout this work are examples of LMs.
この研究で使われたモデルはすべてLMの例である。
0.75
Given a sequence of words, the task of an LM is to predict which word is most likely to come next:
単語列が与えられた場合、LMのタスクは、どの単語が次に来るかを予測することである。
0.69
(1) Where wt is the word to be predicted by the LM at timestep t, and w1:t−1 is the prior t − 1 words given to the LM to be used to make said prediction (Jurafsky, 2000).
1) wt が時間ステップ t において lm によって予測される単語であり、w1:t−1 が wt に与えられた前の t − 1 の単語である場合(jurafsky, 2000)。 訳抜け防止モード: (1)どこで wt は時間ステップ t において lm によって予測される単語である。 w1 : t−1 は lm に与えられる前の t − 1 ワードである 予測を行うには ( Jurafsky , 2000 ) .
0.86
P (wt|w1:t−1)
P (wt|w1:t−1)
0.34
Fig. 1. A simple ANN for the task of sentiment analysis.
図1。 感情分析のための単純なANN
0.40
Words are input to the hidden layers, which learn to map an arbitrary sequence to a fixed output space (positive or negative sentiment).
単語は隠された層に入力され、任意のシーケンスを一定の出力空間(正または負の感情)にマッピングする。
0.74
Note that the input layer is typically not counted when listing the total number of layers.
一般的に、入力層は、レイヤーの総数をリストするときはカウントされない。
0.77
An LM can be constructed using a variety of probabilistic models, however, the one most relevant to this work is the artificial neural network (ANN).
At a high level, they operate by taking in as input a vector representation in the input layer, performing a series of transformations on the input in each hidden layer, and finally mapping the hidden layer to fixed-length representation in the output layer.
Figure 1 shows a schematic representation of a simple 2-layer ANN.
図1は、単純な2層ANNのスキーマ表現を示している。
0.68
As the hidden layers within an ANN can perform a variety of non-linear transformations
ANN内の隠蔽層は様々な非線形変換を行うことができる
0.85
英語(論文から抽出)
日本語訳
スコア
4 Laverghetta Jr. et al
4 laverghetta jr.など
0.47
Fig. 2. The architecture of the transformer.
図2。 トランスのアーキテクチャ。
0.50
The input sequence is given with a mask token <mask> and the correct word is given as output.
入力シーケンスはマスクトークン<mask>で与えられ、正しい単語が出力として与えられる。
0.76
The model predicts the probability distribution for the mask token and changes its weights after comparing its predictions with the actual next word.
モデルはマスクトークンの確率分布を予測し、その予測と実際の次の単語を比較して重みを変化させる。
0.83
During pre-training, this process is repeated many times over a large number of sentences and documents.
事前訓練の間、このプロセスは多数の文や文書に対して何度も繰り返される。
0.77
to the input, ANNs are quite expressive in the kinds of representations they can learn (Yarotsky, 2021), which makes them highly effective as models of language.
The neural language model was first introduced in Bengio et al (2003), and works by using ANNs to approximate the probability of each word, given the prior sequence of words.
ニューラルネットワークモデルは、bengio et al (2003) で初めて導入され、先行する単語列を与えられた各単語の確率を近似するために ann を使用して動作する。
0.71
Since the advent of neural language modeling, more sophisticated neural networks have been employed in NLP, including Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) networks, and transformers (Vaswani et al, 2017).
ニューラルネットワークモデリングの登場以来、Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) やTransformer (Vaswani et al, 2017) など、より洗練されたニューラルネットワークがNLPに採用されている。
0.68
These types of neural networks perform the same basic set of operations as the vanilla ANN, but they differ in how their architectures are designed.
LSTMs rely on using recurrent (cyclic) links between hidden layers, which allows information from previous hidden layers to affect the representations learned in later layers.
Attention works by masking out less relevant portions of the input, such that they contribute less information to later layers.
注意は入力の関連性が低い部分をマスキングすることで機能し、後続のレイヤへの情報提供が少なくなる。
0.68
For example, given the sentence “The dog sat in the chair.”, attention would learn that “sat” and “chair” contribute more to the meaning of “dog” in this sentence than words like “the.”
例えば、"The dog sat in the chair."という文を考えると、"sat" や "chair" が "the" のような単語よりも "dog" の意味に寄与していることがわかる。
0.76
As discussed earlier, while attention has been employed in previous types of neural networks, transformers are unique in that they use only attention to learn representations of the input, throwing out recurrent layers entirely.
Figure 2 shows the general structure of a transformer.
図2は変換器の一般的な構造を示しています。
0.65
A transformer block consists of only an attention operation, followed by a standard hidden
変圧器ブロックは注意操作のみで構成され、その後標準隠された
0.78
英語(論文から抽出)
日本語訳
スコア
Predicting Human Psychometric Properties Using Computational Language Models
計算言語モデルを用いた人間の心理的特性の予測
0.63
5 layer from a typical ANN.
5 典型的なANNのレイヤ。
0.56
Despite their simplicity, transformers have proven to be highly versatile models, and have surpassed the performance of previous successful architectures on virtually every NLP task.
While there are a variety of approaches to training an LM, by far the most successful of them was pioneered by Devlin et al (2018) who introduced the BERT (Bidirectional Encoder Representations from Transformers) LM.
LMのトレーニングには様々なアプローチがあるが、最も成功したのはDevlin et al (2018) で、BERT (Bidirectional Encoder Representations from Transformers) LMを導入した。
0.64
BERT is a transformer that is trained in two stages, the first being pre-training where the model is trained using a self-supervised language modeling objective over a large corpus of text.
In the second finetuning stage the model is further trained on a labeled dataset for a particular task, thus allowing the same pre-trained model to be used for many different tasks.
One can think of the pre-training stage as giving the model a large amount of domain-general knowledge, whereas the finetuning stage focuses on how to use that knowledge to solve a specific task.
There are typically three choices: either p does textually entail h (entailment), p entails that h is impossible (contradiction), or h’s truth can not be determined from p alone (neutral).
典型的には3つの選択肢がある: p はテキストで h を包含し、p は h が不可能である(対照的に)か、h の真理は p 単独(中性)から決定できない(中性)。
0.80
Whether p entails or does not entail h can depend on many factors, such as the syntactic relationships between the sentences, the information that the sentences convey, or some external knowledge about the world.
p が h を含むか否かは、文間の統語的関係、文が伝達する情報、世界に関する外部知識など、多くの要因に依存する。 訳抜け防止モード: p が h を含まないか否かは、多くの因子に依存する。 例えば、文間の構文的関係や、文が伝える情報などです。 または世界に関する外部の知識。
0.71
For example, consider an NLI question with p = “My dog needs to be walked.” and h = “My dogs need to be walked.”
例えば、p = "私の犬は歩く必要がある" と h = "私の犬は歩く必要がある" という NLI の質問を考える。
0.87
We would say that h contradicts p because it was established in p that I have only one dog.
h は p に 1 匹しか飼っていないという理由で p と矛盾していると言えるだろう。
0.65
As another example, consider p = “The BART line I always take was delayed.” and h = “I’m going to miss my tour of the Statue of Liberty.”
別の例として、p = "私がいつも取る BART 行は遅れた" と h = "自由の女神像のツアーを逃すつもりだ" と考えてみましょう。
0.79
We might say that this is a contradiction because the BART operates in San Francisco and not New York City.
However, we might also say that p is neutral with respect to h (perhaps I need to ride the BART to the airport, where I will then fly to New York City).
しかし、p は h に関して中立であると言えるかもしれない(おそらく私はバートに乗って空港へ行き、そこでニューヨークに飛ぶ必要がある)。
0.66
Regardless, this demonstrates how the NLI task can also incorporate external information not explicitly stated in either sentence.
いずれにせよ、NLIタスクは、どちらの文にも明示されていない外部情報を組み込むこともできる。
0.65
The NLI task was formalized in the PASCAL recognizing textual entailment tasks (Dagan et al, 2006), which were a series of workshops designed to spur the development of NLP systems for inferential reasoning.
NLIタスクは、推論のためのNLPシステムの開発を促進するために設計された一連のワークショップであるPASCAL(Dagan et al, 2006)で形式化された。
0.70
The NLI datasets developed for these tasks were quite small, having only a few thousand items in total, which made it very difficult to train deep neural networks on them.
The Stanford natural language inference (SNLI) (Bowman et al, 2015) corpus was the first large-scale dataset of NLI questions, having around 570,000 items in total, which made it practical to train LMs for NLI.
スタンフォード大学の自然言語推論 (SNLI) コーパス (Bowman et al, 2015) は、NLI質問の最初の大規模データセットであり、合計で約570,000の項目があり、NLIのためのLMのトレーニングを実践した。
0.72
Since the release of SNLI, other large-scale NLI datasets have been curated, including MultiNLI (MNLI) (Williams et al, 2018) and Adversarial NLI (ANLI) (Nie et al, 2020), each of which curates NLI questions of varying levels of difficulty and covers different domains of text (fictional stories, news, telephone conversations, etc.).
SNLIのリリース以来、MultiNLI (MNLI) (Williams et al, 2018) やAdversarial NLI (ANLI) (Nie et al, 2020) など、大規模なNLIデータセットがキュレーションされ、それぞれのNLI質問はさまざまな難易度をキュレーションし、異なるテキスト領域(フィクション、ニュース、電話会話など)をカバーする。 訳抜け防止モード: SNLIのリリース以来、他の大規模なNLIデータセットがキュレーションされている。 MultiNLI (MNLI ) ( Williams et al, 2018)を含む。 and Adversarial NLI (ANLI ) (Nie et al, 2020) それぞれの難易度の異なるNLI質問をキュレートする さまざまな分野のテキスト(フィクション、ニュース、電話会話など)をカバーしている。
0.82
This has made the NLI task
これによりNLIタスクが実現した
0.66
英語(論文から抽出)
日本語訳
スコア
6 Laverghetta Jr. et al
6 laverghetta jr.など
0.47
quite general in the kinds of reasoning it can test for, while also being straightforward to administer to both humans and LMs, which makes the task ideal for the present study.
More recently, there has been a trend to developing more comprehensive assessments of LM performance, meant to mimic the diverse skill sets a model would need to master when operating in the real world.
The General Language Understanding Evaluation (GLUE), as well as its more recent extension SuperGLUE (Wang et al, 2018, 2019), are such benchmarks and are meant to assess a broad set of linguistic reasoning competencies.
General Language Understanding Evaluation(GLUE)は、最近の拡張SuperGLUE(Wang et al, 2018, 2019)と同様に、そのようなベンチマークであり、幅広い言語推論能力の評価を目的としている。
0.74
GLUE was curated by combining previous datasets into a single benchmark task, covering a diverse set of underlying skills, including NLI, question answering, paraphrase detection, and others.
As there has been rapid progress in NLP in recent years, the authors of GLUE found that the benchmark quickly lost the ability to discriminate between high and low-performance LMs on the tasks it covered.
The diagnostic covers four main categories of linguistic competencies: lexical semantics, predicate-argument structure, logic, and knowledge and common sense.
この診断は、語彙意味論、述語-弁証構造、論理学、知識と常識の4つの主要なカテゴリーをカバーする。
0.68
These categories are further divided into multiple sub-categories, each of which covers a specific and interesting phenomenon in language.
The broad coverage diagnostic was manually curated by linguistics and NLP experts and is meant to assess broad psycholinguistic competencies of LMs across multiple categories.
For instance, the propositional structure category contains questions that exploit propositional logic operators, e g , p = “The cat sat on the mat.” and h = “The cat did not sit on the mat.”
例えば、命題構造カテゴリには、命題論理演算子を利用する質問が含まれている: e g , p = "The cat sat on the mat" と h = "The cat didn not sit on the mat" である。
0.82
The diagnostic thus aims to be a comprehensive test of linguistic reasoning skills, making it suitable for our present study.
この診断は, 言語推論スキルの総合的なテストであり, 本研究に適合することを目的としている。
0.80
As discussed in Section 3, we use only the following seven sub-categories from the diagnostic for our experiments:
第3節で述べたように、実験の診断には以下の7つのサブカテゴリのみを使用します。
0.73
1. morphological negation: Covers questions that require reasoning over negation in
1.形態的否定:否定に関する推論を必要とする質問をカバーする
0.63
either its logical or psycholinguistic form.
論理的か精神言語的かです
0.65
2. prepositional phrases: Tests for the ability to handle ambiguity introduced by the insertion or removal of prepositions (e g , p = “Cape sparrows eat seeds, along with soft plant parts and insects.” and h = “Soft plant parts and insects eat seeds.”).
From this subset, we had 811 diagnostic questions encompassing 20 sub-categories.
このサブセットから,20のサブカテゴリを含む811の診断質問を行った。
0.73
Each sub-category had at least 15 questions, and we selected the seven sub-categories enumerated in Section 2.3 to use in our experiments.
各サブカテゴリには少なくとも15の質問があり、2.3節に列挙された7つのサブカテゴリを選択した。
0.65
We selected these 7 sub-categories based on how much the average performance of the LMs improved after pre-training and finetuning.
これらの7つのサブカテゴリを,事前学習および微調整後のLMの平均性能改善率に基づいて選択した。
0.68
A substantial performance improvement indicated the category was solvable by the models, and would therefore provide a meaningful comparison to the human data.
We gathered responses to the diagnostic from a wide array of TLMs, including BERT (Devlin et al, 2018), RoBERTa (Liu et al, 2019), T5 (Raffel et al, 2020), ALBERT (Lan et al, 2020), XLNet (Yang et al, 2019), ELECTRA (Clark et al, 2020), Longformer (Beltagy et al, 2020), SpanBERT (Joshi et al, 2020), DeBERTa (He et al, 2020), and ConvBERT (Jiang et al, 2020).
BERT (Devlin et al, 2018), RoBERTa (Liu et al, 2019), T5 (Raffel et al, 2020), ALBERT (Lan et al, 2020), XLNet (Yang et al, 2019), ELECTRA (Clark et al, 2020), Longformer (Beltagy et al, 2020), SpanBERT (Joshi et al, 2020), DeBERTa (He et al, 2020), ConvBERT (Jiang et al, 2020), ConvBERT (Jiang et al。 訳抜け防止モード: 我々は広範囲のTLMから診断に対する反応を収集した。 BERT (Devlin et al, 2018 )、RoBERTa (Liu et al, 2019 )を含む。 T5 (Raffel et al, 2020 ), ALBERT ( Lan et al, 2020 ), XLNet ( Yang et al, 2019 ), ELECTRA (Clark et al, 2020 ) Longformer (Beltagy et al, 2020 ), SpanBERT (Joshi et al, 2020 ) DeBERTa (He et al, 2020 ) と ConvBERT (Jian et al, 2020 )。
0.79
Each of these models differs from the others along one or more factors, including underlying architecture, pre-training objective and data, or the general category the model belongs to.
The smaller versions of each TLM contained fewer transformer blocks, and thus fewer trainable parameters, making them less expressive models of language.
We used LSTM-based LMs (Hochreiter and Schmidhuber, 1997) as a baseline, which, unlike TLMs, primarily rely on recurrent links, as opposed to attention.
我々はLSTMに基づくLM(Hochreiter and Schmidhuber, 1997)をベースラインとして用いた。 訳抜け防止モード: LSTMに基づくLM(Hochreiter and Schmidhuber, 1997)をベースラインとして使用した。 TLMとは異なり、注意力とは対照的に、主にリカレントリンクに依存しています。
0.62
We used the SNLI (Bowman et al, 2015), MNLI (Williams et al, 2018), and ANLI (Nie et al, 2020) datasets to finetune our models for the NLI task.
SNLI(Bowman et al, 2015), MNLI(Williams et al, 2018), ANLI(Nie et al, 2020)データセットを使用して、NLIタスクのモデルを微調整しました。
0.73
To increase the variance in our results as much as possible, we finetuned all models on various combinations of these datasets: (1) SNLI alone, (2) MNLI alone, (3) SNLI + MNLI, and (4) SNLI + MNLI + ANLI.
Recall that all TLMs are trained in two stages: pre-training and then finetuning.
すべてのTLMが事前トレーニングと微調整の2段階でトレーニングされていることを思い出してください。
0.55
As the performance of our models on the diagnostic will be affected by both, we systematically alter whether a model is pre-trained or finetuned to further increase variance, using the following combinations:
The model is trained for NLI, but the total amount of language it has been exposed to is much smaller without pre-training.
モデルはNLIのために訓練されているが、露出した言語総量は事前訓練なしでははるかに少ない。
0.80
– Pre-train and finetune: The model is fully trained before evaluation.
事前訓練と微調整: モデルは評価前に完全に訓練される。
0.70
For BERT, we experimented with both Devlin et al (2018)’s pre-trained models, and a BERT model we trained from scratch.
BERTでは、Devlin et al(2018)の事前トレーニングモデルと、スクラッチからトレーニングしたBERTモデルの両方を試しました。 訳抜け防止モード: BERTでは、Devlin et al (2018)の事前トレーニングモデルの両方を試しました。 そして、スクラッチからトレーニングしたBERTモデルです。
0.75
Our BERT model had an identical architecture to bert-base and was pre-trained on Google’s One Billion Words corpus (Chelba et al, 2014), which is a dataset of documents from various sources created by Google for pre-training LMs.
私たちのBERTモデルはbert-baseと同じアーキテクチャで、Googleの1億ワードコーパス(Chelba et al, 2014)で事前トレーニングされています。 訳抜け防止モード: 私たちのbertモデルはbertと同じアーキテクチャでした -ベースでプレ。 google の10億ワードコーパスでトレーニングされた(chelba et al, 2014)。 これは、googleがpre-training lmsのために作成したさまざまなソースからのドキュメントのデータセットである。
0.61
In summary, this process allowed us to vary the underlying architecture, the size of each architecture, and the amount of data the model was trained on.
This allowed us to treat each trained model as effectively being a different “individual” (and we will refer to them as such), which might have a radically different cognitive profile from its counterparts.
For example, a roberta-base model that was pre-trained and finetuned on all three NLI datasets would likely be much more proficient on our diagnostic than a roberta-large model trained on no NLI data at all.
While mTurk makes conducting large-scale human studies convenient, there are also well-documented problems with participants not completing tasks in good faith (Berinsky et al, 2014).
mTurkは、大規模な人間の研究を便利に行うが、参加者が誠実にタスクを完了しないという、文書化された問題もある(Berinsky et al, 2014)。
0.55
There are multiple techniques for filtering out bad-faith participants, such as the use of attention check questions, sometimes called “instructional manipulation checks” (Hauser and Schwarz, 2015), which are designed so that a good-faith participant would be unlikely to get them incorrect.
注意チェック質問の使用("instructional manipulation checks" (hauser and schwarz, 2015) と呼ばれることもある)など、不正な参加者をフィルタリングする複数のテクニックがある。 訳抜け防止モード: 悪い参加者をフィルタリングする複数のテクニックがあります。 例えば、注意チェック質問の使用は、“命令操作チェック”と呼ばれることがある(hauser and schwarz, 2015)。 このデザインは 善良な信仰の参加者は、それを不正確なものとは考えにくいだろう。
0.69
But this alone would not suffice for our purposes here, as we want a certain amount of low-scoring participants on some sub-categories, so that the population variances on sub-category items would better reflect their actual variances.
We first obtained attention checks from the ChaosNLI dataset (Nie et al, 2020), which gathered over 450,000 human annotations on questions from SNLI and MNLI.
最初にChaosNLIデータセット(Nie et al, 2020)から、SNLIとMNLIの質問に対する45万以上の人的アノテーションを収集した。 訳抜け防止モード: 最初にChaosNLIデータセット(Nie et al, 2020)から注意点検を行った。 SNLIとMNLIの質問に対して45万以上の人間のアノテーションを集めた。
0.62
Since each question in ChaosNLI was annotated by 100 different workers, if the inter-annotator agreement for a given question is extremely high, we conclude that question is likely easy to solve for good-faith participants.
The human studies were split up into five phases, and workers who did sufficiently well in a given phase were given a qualification to continue to the next phase:
1. On-boarding: A qualifying HIT (human intelligence task) open to any worker located in the United States, who had completed at least 50 HITs with an approval rating of at least 90%.
Workers were informed before starting every study that we would evaluate the quality of their work, and that it might be rejected if we found evidence that they did not put forth an honest effort.
In each phase, questions were randomly ordered, except for attention checks which were spread evenly throughout the survey.
各段階で質問はランダムに注文され、調査全体で均等に散らばった注意点検を除いて行われた。
0.66
We used Qualtrics1 to create the surveys for each HIT and collect the responses.
私たちはQualtrics1を使って、各HITに対する調査を作成し、回答を収集しました。
0.61
Participants were first presented with instructions for the task and some examples, which were based on the instructions originally given to annotators of the MNLI dataset.
For each question, workers also had to provide a short justification statement on why they believed their answer was correct, which was used to help filter out bad faith participants.
To validate the responses to our surveys, we developed the following authentication procedure:
調査に対する回答を検証するため,以下の認証手順を開発した。
0.84
1. Look for duplicate IPs or worker IDs, indicating that the worker took the HIT more
1. 労働者がHITをもっと利用したことを示す重複IPや労働者IDの検索
0.80
than once. If there are any, keep only the first submission.
一度も もし何かあったら、最初の提出のみを保持する。
0.65
2. If the worker’s overall score was less than 40%, reject the HIT.
2. 労働者の総得点が40%未満の場合、HITを拒絶する。
0.71
If their overall score was greater than 60%, accept the HIT.
総合得点が60%以上であれば、HITを受諾する。
0.68
For workers who scored between 40% and 60%, reject the HIT if they got less than 75% of the attention checks correct.
40%から60%のスコアを付けた労働者は、注意点の75%未満が正しい場合はヒットを拒否する。
0.69
3. Finally, examine the justifications of all workers not previously rejected.
3.最後に、以前拒絶されたすべての労働者の正当性を検討する。
0.59
Here we were looking for simple, but clear, reasons for why workers chose their answer.
ここでは、労働者が答えを選んだ理由について、単純だが明確な理由を探していた。
0.59
We included this step because we found in a pilot study that workers sometimes provided nonsensical justifications for their answers even when they did well on the survey, making it unclear whether they were truly paying attention.
the justifications appeared relevant to the question, that they did not paste part of the question for their justification, that they did not use the same justification for every question, and that they did not use short nonsensical phrases for their justification (e g , some simply wrote “good” or “nice” as their justification).
This allowed us to keep some low-scoring participants who had put genuine effort into the task.
これにより、タスクに真剣な努力をした低熱の参加者を維持できたのです。
0.52
Manual inspection of the resulting responses suggested that workers whose responses were accepted consistently gave higher quality responses than those who did not.
On the other hand, workers who failed to give good justifications generally scored at or below random chance, which further indicated that they were not actually paying attention.
We, therefore, believe the use of justifications helped us gather higher-quality responses.
したがって、正当化の使用は高品質な回答を集めるのに役立ちます。
0.53
Using this procedure, and those described in Section 3, we gathered results from 27 human participants and 240 neural LMs (183 transformer-based and 57 LSTMbased).
In addition to the LSTMs, we also include a true random baseline which simply guesses randomly on every question.
LSTMに加えて、すべての質問に対してランダムに推測する真のランダムベースラインも含んでいる。
0.74
In the following experiments, we use the human performance on each category as the basis for analyzing the performance of the artificial populations, specifically using methods from classical test theory (both simple problem difficulty and inter-item correlation) and Rasch models (Rasch, 1993) from item response theory.
Our goal is to determine how well item properties measured using artificial models correlate with those measured using the humans responses, using both pearson and spearman correlation coefficients.
We shall refer to the transformer population as T , the LSTM population as L, the random population as R, and the human population as H. We used the ltm R package to fit all Rasch models (Rizopoulos, 2006).
Given DH, Spearman correlation and p-values were calculated with transformer-based (DT ), LSTM-based (DL), and random (DR) estimates of problem difficulty.
Predicting Human Psychometric Properties Using Computational Language Models
計算言語モデルを用いた人間の心理的特性の予測
0.63
11 5.1 Classical Test Theory
11 5.1古典的テスト理論
0.56
R). We then calculated the Spearman correlation between Di
R)。 次にdi間のスピアマン相関を計算し
0.54
We began by examining how well TLMs could predict simple problem difficulty in the human data.
まず、人間のデータにおける単純な問題の難しさをTLMがいかに予測できるかを検討する。
0.64
For each item i in a given sub-category, we calculated the percentage H), and then the corresponding of human participants who got that item correct (Di percentage for the TLMs (Di L), and the random baseline (Di H and each of the other populations.
In almost all cases, TLMs achieve a much stronger correlation with the human data than either baseline.
ほぼ全てのケースにおいて、tlmはどちらのベースラインよりも人間データと非常に強い相関関係を持つ。
0.61
The main exceptions are morphological negation and richer logical structure, both of which fail to produce strong, statistically significant correlations.
主な例外は形態的否定とよりリッチな論理構造であり、どちらも強い統計的に有意な相関を生じさせない。
0.70
As we will see, this pattern will repeat in other measurements as well.
ご覧の通り、このパターンは他の測定でも繰り返されます。
0.71
T ), LSTM-based LMs (Di
T),LSTMによるLM(Di)
0.84
IIC-Based Clustering An important idea in psychometrics is that items that rely on the same skills should have similar chances of being answered correctly by a given participant (Rust and Golombok, 2014).
IICベースのクラスタリング サイコメトリックスにおける重要な考え方は、同じスキルに依存しているアイテムは、与えられた参加者によって正しく答えられる可能性を持つべきであるということだ(Rust and Golombok, 2014)。
0.57
Whether items rely on similar skills can be tested using the inter-item correlation (IIC) between two items, where high IIC suggests that the items rely on similar underlying reasoning skills.
i,j as 1 if i and j are in the same cluster as determined by dataset D ∈ {H, T, L, R}, and 0 otherwise.
i と j がデータセット D ∈ {H, T, L, R} と 0 で決定されるのと同じクラスタであるなら、i,j は 1 である。
0.88
Finally, to determine how well clusters from the LM responses match the human responses, we calculated Pearson correlation between CH and each of CT , CL, and CR.
Similar to Table 1, we see statistically significant correlations from TLMs in every sub-category, except again for morphological negation.
表1と同様に、形態的否定を除いて、各サブカテゴリのTLMから統計的に有意な相関が見られる。
0.69
After clustering, for each pair of items (i, j) we define C D
クラスタリングの後、各項目 (i, j) に対して c d を定義する。
0.77
Table 2. Pearson correlation and p-values for how well items clustered using human responses match the clusters which used transformer-based (CT ), LSTM-based (CL), and random (CR) items.
Since TLMs correlated well with humans using the classical techniques we tested, we wished to examine whether this would still hold using methods from item response theory (IRT).
To do this, we used the diagnostic results from each population to fit Rasch models (Rasch, 1993).
これを実現するために,raschモデルに適合する各集団の診断結果を用いた(rasch,1993)。
0.83
This gave us separate difficulty parameter estimates bi for each item i, for each population.
これにより、各項目 i, 各集団の難易度パラメータ bi を別々に推定できる。
0.70
To determine how well the difficulty parameters matched between populations, we calculated the Pearson correlation between the bi using our human response data (H), and the bi obtained using the other populations (T , L, R).
As before, TLMs consistently get a stronger correlation than either baseline on most sub-categories, except for morphological negation and richer logical structure.
The only other experiment where LSTM-based LMs achieved stronger correlation was reported in Table 2, where they achieved superior correlation on morphological negation.
Table 3. Pearson correlation and p-values for transformer-based (DT ), LSTM-based (DL), and random (DR) estimates of problem difficulty computed using Rasch models.
表3。 ラッシュモデルを用いて計算した変圧器(DT)、LSTM(DL)、ランダム(DR)のピアソン相関とp値 訳抜け防止モード: 表3。 pearson correlation and p - values for transformer - based (dt ) lstm - based (dl ) と random (dr ) によるraschモデルを用いて計算した問題難易度の推定。
0.54
Columns marked with * are significant at p < 0.05, ** at p < 0.01, and *** at p < 0.001.
* でマークされたカラムは p < 0.05 で、** は p < 0.01 で、*** は p < 0.001 で有意である。
Quantifiers Propositional Structure Richer Logical Structure World Knowledge
数量化器 命題構造 豊かな論理構造 世界知識
0.76
6 Related Work What reason do we have to suspect that TLMs can predict the psychometric properties of test items?
6 関連作業 TLMがテスト項目の心理測定特性を予測できることを疑う理由は何だろうか?
0.78
Although TLMs were not primarily designed to compute in a human-like way, there are some reasons to suspect that they may have the ability to effectively model at least some aspects of human linguistic reasoning: They consistently demonstrate superior performance (at least compared to other LMs) on human-inspired linguistic benchmarks (Wang et al, 2018, 2019), and they are typically pre-trained using a lengthy process designed to embed deep semantic knowledge, resulting in efficient encoding of semantic relationships (Zhou et al, 2020; Cui et al, 2020).
Although TLMs were not primarily designed to compute in a human-like way, there are some reasons to suspect that they may have the ability to effectively model at least some aspects of human linguistic reasoning: They consistently demonstrate superior performance (at least compared to other LMs) on human-inspired linguistic benchmarks (Wang et al, 2018, 2019), and they are typically pre-trained using a lengthy process designed to embed deep semantic knowledge, resulting in efficient encoding of semantic relationships (Zhou et al, 2020; Cui et al, 2020). 訳抜け防止モード: tlmは、主に人間で計算するように設計されているわけではない。 いくつか理由があります 人間の言語的推論の少なくともいくつかの側面を効果的にモデル化できるかもしれないと疑う :人間-インスパイアされた言語ベンチマーク(wang et al, 2018, 2019)において、一貫して優れたパフォーマンス(少なくとも他のlmsと比較して)を示す。 そして、それらは通常、深い意味の知識を埋め込むように設計された長いプロセスを使って事前訓練されます。 意味関係の効率的な符号化を可能にする(zhou et al, 2020; cui et al,)。 2020 ) .
0.63
Common optimization tasks for pre-training transformers, such as the masked LM task (Devlin et al, 2018) are quite similar to the word prediction tasks that are known to predict children’s performance on other linguistic skills (Gambi et al, 2020).
マスク付きLMタスク(Devlin et al, 2018)のような事前学習用トランスフォーマーの一般的な最適化タスクは、他の言語スキル(Gambi et al, 2020)上での子供のパフォーマンスを予測することが知られている単語予測タスクと非常によく似ている。
0.73
Finally, TLMs tend to outperform other LMs in
最後に、TLMは他のLMよりも優れている傾向があります。
0.47
英語(論文から抽出)
日本語訳
スコア
Predicting Human Psychometric Properties Using Computational Language Models
計算言語モデルを用いた人間の心理的特性の予測
0.63
13 recent work modeling human reading times, eye-tracking data, and other psychological and psycholinguistic phenomena (Schrimpf et al, 2020b,a; Hao et al, 2020; Merkx and Frank, 2021; Laverghetta Jr. et al, 2021).
13 人間の読書時間、視線追跡データ、その他の心理的・心理学的な現象(Schrimpf et al, 2020b,a; Hao et al, 2020; Merkx and Frank, 2021; Laverghetta Jr. et al, 2021)をモデル化している。
0.84
Despite the potential benefits psychometrics could bring to AI, work explicitly bridging these fields has been limited.
Ahmad et al (2020) created a deep learning architecture for extracting psychometric dimensions related to healthcare, specifically numeracy, literacy, trust, anxiety, and drug experiences.
ahmad et al (2020)は、医療、特に数量、リテラシー、信頼、不安、薬物経験に関する心理学的次元を抽出するための深層学習アーキテクチャを作成した。
0.68
Their architecture did not use transformers and relied instead on a sophisticated combination of convolutional and recurrent layers in order to extract representations of emotions, demographics, and syntactic patterns, among others.
Eisape et al (2020) examined the correlation between human and LM next-word predictions and proposed a procedure for achieving more human-like cloze probabilities.
eisape et al (2020)は、人間とlmの次の単語予測の相関を調べ、より人間的なクローズ確率を達成するための手順を提案した。
0.58
In NLP, methods from IRT have been particularly popular.
NLPでは、IRTのメソッドは特に人気がある。
0.71
Lalor et al (2018) used IRT models to study the impact of item difficulty on the performance of deep models on several NLP tasks.
Lalor et al (2018) は IRT モデルを用いて、いくつかの NLP タスクにおけるディープモデルの性能に対するアイテム難易度の影響を調査した。
0.71
In a follow-up study, Lalor and Yu (2020) used IRT models to estimate the competence of LSTM (Hochreiter and Schmidhuber, 1997) and BERT models during training.
続く研究で、Lalor and Yu (2020) は IRT モデルを用いて LSTM (Hochreiter and Schmidhuber, 1997) と BERT モデルの訓練における能力の推定を行った。
0.91
Sedoc and Ungar (2020) used IRT to efficiently assess chat-bots.
Sedoc と Ungar (2020) は IRT を使ってチャットボットを効率的に評価した。
0.57
Mart´ınez-Plumed et al (2019) used IRT to analyze the performance of machine learning classifiers in a supervised learning task.
Mart ́ınez-Plumed et al (2019) はIRTを用いて教師付き学習タスクにおける機械学習分類器の性能を分析した。
0.67
IRT has also been used to evaluate machine translation systems (Otani et al, 2016) and speech synthesizers (Oliveira et al, 2020).
IRTは機械翻訳システム(Otani et al, 2016)や音声合成システム(Oliveira et al, 2020)の評価にも使用されている。
0.86
Recent work has also used IRT models to evaluate progress on benchmark NLP tasks (Vania et al, 2021; Rodriguez et al, 2021).
最近の研究では、IRTモデルを用いてベンチマークNLPタスクの進捗を評価する(Vania et al, 2021; Rodriguez et al, 2021)。
0.74
We contribute to this literature by providing what is, to our knowledge, the first comprehensive assessment of the relationships between human and LM psychometric properties on a broad test of lingusitic reasoning.
However, this improvement is also not uniform across all categories.
しかし、この改善はすべてのカテゴリに一様ではない。
0.72
In fact, we have found some regularities in this regard.
実際、この点に関していくつかの規則性を発見した。
0.50
In particular, TLMs failed to achieve a strong correlation on morphological negation in all cases.
特に, TLMはすべての症例において形態的否定に強い相関が得られなかった。
0.68
This might be explained by two facts: there is little relative variance in the human responses in this sub-category, and the average accuracy of human participants was above 90%, as opposed to LM accuracy of 55%.
If this were successful, it would greatly reduce the burden of multiple rounds of empirical testing.
これが成功すれば、経験的テストの複数ラウンドの負担を大幅に削減できるでしょう。
0.67
Of course, this study also has some important limitations.
もちろん、この研究にはいくつかの重要な制限がある。
0.63
The number of human participants in our study was somewhat small compared to typical psychometrics studies (which often contain hundreds or thousands of participants), making it difficult to draw stronger conclusions.
As stated earlier, practical limitations on population size is a common problem in psychometrics research, one which our present work hopes to alleviate somewhat.
Furthermore, although we reported in detail on certain psychometrics measures where our method demonstrated promising results for TLMs, it is worth reporting that certain other measures we examined did not appear to align well.
While this study has given us some insights into which fundamental reasoning skills TLMs can model well, it does not tell us anything about the order in which these skills are acquired, and especially whether this order is at all human-like.
For example, in our experiments, we found that TLMs consistently achieved a strong correlation on items requiring mastery of logical operators and lexical entailment (e g , p = “The dog is on the mat and the cat is in the hat” and h = “The dog is on the mat”).
However, if we found that TLMs develop the ability to solve problems with conjunct-containing sentences before those with simpler sentences (e g , p = “The dog is on the mat” and h = “The dog is not on the mat”) this would clearly not reflect the order of skill acquisition we would expect to see in humans.
しかし、tlmが、より単純な文(例えば、p = “the dog is on the mat” と h = “the dog is not on the mat”)を持つ文よりも前に、結合を含む文で問題を解決する能力を持っていると分かった場合、これは明らかに、人間に期待されるスキル獲得の順序を反映しない。
0.73
Other methods from psychometrics, especially cognitive diagnostic models (Rupp and Templin, 2008) might give us a more nuanced understanding of how effective TLMs are as a model of human learning and development.
Finally, while our experiments have given us some insights into the validity and reliability of the diagnostic items, it is unclear whether our approach can allow us to measure their fairness.
It is not known whether the test items we examine here are consistent across different groups of differing socio-economic statuses, and we did not control for this in our recruitment.
Being able to probe this property of items would have interesting downstream applications.
アイテムのこの特性を探索できることは、下流の興味深い応用をもたらすだろう。
0.64
For instance, it might indicate whether a diagnostic gives an unfair advantage to certain types of classifiers, and thus might discriminate against certain groups.
We believe our work offers a clear path forward for bridging psychometrics and AI.
私たちの仕事は、サイコメトリックスとAIを橋渡しするための明確な道筋を提供すると信じています。
0.42
The use of psychometric measures gives us a more nuanced understanding of the latent abilities of LMs than single-valued measures like accuracy or F1 can provide.
Furthermore, the increasingly powerful ability of TLMs to model human “commonsense” reasoning and knowledge suggests new ways to predict psychometric properties of test items, reducing the need for costly human empirical data.
Acknowledgments This material is based upon work supported by the Air Force Office of Scientific Research under award numbers FA9550-17-1-0191 and FA9550-18-1-0052.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the United States Air Force.
Bibliography Ahmad F, Abbasi A, Li J, Dobolyi DG, Netemeyer RG, Clifford GD, Chen H (2020) A deep learning architecture for psychometric natural language processing.
書誌 Ahmad F, Abbasi A, Li J, Dobolyi DG, Netemeyer RG, Clifford GD, Chen H (2020) サイコメトリック自然言語処理のためのディープラーニングアーキテクチャ。
0.67
ACM Transactions on Information Systems (TOIS) 38(1):1–29
ACM Transactions on Information Systems (TOIS) 38(1):1-29
0.48
Beltagy I, Peters ME, Cohan A (2020) Longformer: The Long-Document Transformer.
Beltagy I, Peters ME, Cohan A (2020) Longformer: The Long-Document Transformer
0.43
arXiv preprint arXiv:200405150
arXiv preprint arXiv:200405150
0.39
Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language
Bengio Y, Ducharme R, Vincent P, Janvin C (2003) 神経確率言語
0.38
model. The journal of machine learning research 3:1137–1155
モデル。 機械学習研究雑誌3:1137-1155
0.72
Berinsky AJ, Margolis MF, Sances MW (2014) Separating the Shirkers from the Workers?
Making Sure Respondents Pay Attention on Self-Administered Surveys.
回答者が自給自足調査に注意を払うこと。
0.60
American Journal of Political Science 58(3):739–753
American Journal of Political Science 58(3):739–753
0.46
Bowman SR, Angeli G, Potts C, Manning CD (2015) A large annotated corpus for learning natural language inference.
Bowman SR, Angeli G, Potts C, Manning CD (2015) 自然言語推論を学ぶための注釈付きコーパス。
0.76
In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics
In:2015 Conference on Empirical Methods in Natural Language Processing (EMNLP, Association for Computational Linguistics) に参加して 訳抜け防止モード: 自然言語処理における経験的手法に関する2015年会議(emnlp)の開催にあたって 計算言語学連合
0.78
Bringsjord S (2011) Psychometric artificial intelligence.
bringsjord S (2011) サイコメトリック人工知能。
0.67
Journal of Experimental &
journal of experimental &
0.40
Theoretical Artificial Intelligence 23(3):271–277
理論人工知能23(3):271-277
0.77
Bringsjord S, Licato J (2012) Psychometric artificial general intelligence: the piagetmacguyver room.
bringsjord s, licato j (2012) psychometric artificial general intelligence: the piagetmacguyver room。
0.37
In: Theoretical foundations of artificial general intelligence, Springer, pp 25–48
In: The Theory foundations of Artificial General Intelligence, Springer, pp 25–48
0.45
Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T (2014) One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling.
Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T (2014) 統計言語モデリングの進歩を測定するための1億ワードベンチマーク。
0.83
In: Fifteenth Annual Conference of the International Speech Communication Association Clark K, Luong MT, Le QV, Manning CD (2020) ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.
in: 15th annual conference of the international speech communication association clark k, luong mt, le qv, manning cd (2020) electra: pre-training text encoders as discriminator than generators. (英語) 訳抜け防止モード: 第15回国際スピーチ・コミュニケーション・アソシエーション・クラークk年会に参加して luong mt, le qv, manning cd (2020) electra : pre- training text encoder as discriminator than generators 。
0.65
In: ICLR 2020 : Eighth International Conference on Learning Representations
イン:ICLR 2020 : 第8回学習表現に関する国際会議
0.85
Cui L, Cheng S, Wu Y, Zhang Y (2020) Does bert solve commonsense task via com-
Cui L, Cheng S, Wu Y, Zhang Y (2020) bertはcomcomを通じてcommonsenseタスクを解くか?
In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp 4171–4186
In:the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp 4171–4186
0.41
Eisape T, Zaslavsky N, Levy R (2020) Cloze Distillation Improves Psychometric Predictive Power.
eisape t, zaslavsky n, levy r (2020) cloze 蒸留は精神測定予測力を向上させる。
0.56
In: Proceedings of the 24th Conference on Computational Natural Language Learning, pp 609–619
第24回計算自然言語学習会議報告, pp 609-619
0.60
Gambi C, Jindal P, Sharpe S, Pickering MJ, Rabagliati H (2020) The relation between preschoolers’ vocabulary development and their ability to predict and recognize
Gambi C, Jindal P, Sharpe S, Pickering MJ, Rabagliati H (2020) 幼児の語彙発達と予測・認識能力との関係 訳抜け防止モード: ガンビーC、ジンダルP、シャープS、ピカリングMJ Rabagliati H (2020 )幼児の語彙発達との関係 予測し 認識する能力は
0.67
英語(論文から抽出)
日本語訳
スコア
16 Laverghetta Jr. et al
16 laverghetta jr.など
0.47
words. Child Development n/a(n/a), DOI 10.1111/cdev.13465, URL https://srcd.
言葉 Child Development n/a(n/a), DOI 10.1111/cdev.13465, URL https://srcd.
Goldberg Y (2019) Assessing bert’s syntactic abilities.
Goldberg Y (2019) bertの構文能力を評価する。
0.79
CoRR abs/1901.05287, URL
CoRR abs/1901.05287, URL
0.33
http://arxiv.org/abs /1901.05287, 1901.05287
http://arxiv.org/abs /1901.05287, 1901.05287
0.20
Hao Y, Mendelsohn S, Sterneck R, Martinez R, Frank R (2020) Probabilistic Predictions of People Perusing: Evaluating Metrics of Language Model Performance for Psycholinguistic Modeling.
Hao Y, Mendelsohn S, Sterneck R, Martinez R, Frank R (2020) Probabilistic Predictions of People Perusing: Evaluating Metrics of Language Model Performance for Psycholinguistic Modeling。 訳抜け防止モード: hao y, mendelsohn s, sterneck r, martinez r, フランク・r(2020年) : 人口の確率的予測 心理言語モデルにおける言語モデル性能の指標評価
0.76
In: Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pp 75–86
In:Proceeds of the Workshop on Cognitive Modeling and Computational Linguistics, pp 75-86
0.47
Hauser DJ, Schwarz N (2015) It’s a Trap!
Hauser DJ, Schwarz N (2015) It's a Trap!
0.42
Instructional Manipulation Checks Prompt
指示操作チェックプロンプト
0.61
Systematic Thinking on “Tricky” Tasks.
トリッキー」タスクの体系的思考。
0.59
SAGE Open 5(2)
セージオープン5(2)
0.70
He P, Liu X, Gao J, Chen W (2020) DeBERTa: Decoding-enhanced BERT with Disen-
He P, Liu X, Gao J, Chen W (2020) DeBERTa: Decoding-enhanced BERT with Disen-
0.49
tangled Attention. 2006.03654
絡み合った注意。 2006.03654
0.37
Hern´andez-Orallo J, Mart´ınez-Plumed F, Schmid U, Siebers M, Dowe DL (2016) Computer models solving intelligence test problems: Progress and implications.
Hern ́andez-Orallo J, Mart ́ınez-Plumed F, Schmid U, Siebers M, Dowe DL (2016) インテリジェンステストの問題を解決するコンピュータモデル。
0.82
Artificial Intelligence 230:74 – 107, DOI https://doi.org/10.
Hochreiter S, Schmidhuber J (1997) Long short-term memory.
Hochreiter S, Schmidhuber J (1997) 長期記憶。
0.35
Neural computation 9(8):1735–1780
神経計算 9(8):1735–1780
0.51
Hu J, Gauthier J, Qian P, Wilcox E, Levy R (2020) A systematic assessment of syntactic generalization in neural language models.
Hu J, Gauthier J, Qian P, Wilcox E, Levy R (2020) ニューラル言語モデルにおける構文一般化の体系的評価。
0.79
In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp 1725–1744, DOI 10.18653/v1/2020.acl -main.158, URL https://www.aclweb.o rg/anthology/2020.ac l-main.158
In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp 1725–1744, DOI 10.18653/v1/2020.acl -main.158, URL https://www.aclweb.o rg/anthology/2020.ac l-main.158
0.30
Jiang ZH, Yu W, Zhou D, Chen Y, Feng J, Yan S (2020) ConvBERT: Improving BERT with Span-based Dynamic Convolution.
Jiang ZH, Yu W, Zhou D, Chen Y, Feng J, Yan S (2020) ConvBERT: Improving BERT with Span-based Dynamic Convolution 訳抜け防止モード: ZH氏、Yu氏、Zhou D氏、Chen Y氏 Feng J, Yan S (2020 ) ConvBERT : SpanによるBERTの改善 Dynamic Convolution - 動的畳み込み。
0.79
Advances in Neural Information Processing Systems 33
神経情報処理システム33の進歩
0.74
Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2020) Spanbert: Improving pre-training by representing and predicting spans.
Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2020) Spanbert: パンの表現と予測による事前学習の改善。
0.83
Transactions of the Association for Computational Linguistics 8:64–77
計算言語学会のトランザクション 8:64-77
0.73
Jurafsky D (2000) Speech & language processing.
Jurafsky D (2000) 音声と言語処理。
0.38
Pearson Education India Lalor JP, Yu H (2020) Dynamic Data Selection for Curriculum Learning via Ability Estimation.
pearson education india lalor jp, yu h (2020) 能力推定によるカリキュラム学習のための動的データ選択。
0.83
In: Proceedings of the Conference on Empirical Methods in Natural Language Processing.
in: 自然言語処理における経験的手法に関する会議の議事録。
0.75
Conference on Empirical Methods in Natural Language Processing, NIH Public Access, vol 2020, p 545
自然言語処理における実証的手法に関する会議, nih public access, vol 2020, p 545
0.83
Lalor JP, Wu H, Munkhdalai T, Yu H (2018) Understanding deep learning performance through an examination of test set difficulty: A psychometric case study.
Lalor JP, Wu H, Munkhdalai T, Yu H (2018) テストセットの難易度を調べることによって、ディープラーニングのパフォーマンスを理解する:心理学的ケーススタディ。
0.68
In: Proceedings of the Conference on Empirical Methods in Natural Language Processing.
in: 自然言語処理における経験的手法に関する会議の議事録。
0.75
Conference on Empirical Methods in Natural Language Processing, NIH Public Access, vol 2018, p 4711
自然言語処理における実証的手法に関する国際会議, NIH Public Access, vol 2018, p 4711
0.82
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2020) ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2020) ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (英語) 訳抜け防止モード: lan z, chen m, goodman s, gimpel k, sharma p, soricut r (2020 ) albert : 自己のためのlite bert - 言語表現の教師付き学習。
0.63
In: ICLR 2020 : Eighth International Conference on Learning Representations
イン:ICLR 2020 : 第8回学習表現に関する国際会議
0.85
Laverghetta Jr A, Nighojkar A, Mirzakhalov J, Licato J (2021) Can transformer language models predict psychometric properties?
Laverghetta Jr A, Nighojkar A, Mirzakhalov J, Licato J (2021) トランスフォーマー言語モデルは心理学的性質を予測することができるか?
0.75
In: Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, Association for Computational Linguistics, Online, pp 12–25
In: Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, Association for Computational Linguistics, Online, pp 12–25 訳抜け防止モード: 第10回語彙・計算意味論合同会議(sem 2021)の開催にあたって association for computational linguistics , online , pp 12–25
0.74
英語(論文から抽出)
日本語訳
スコア
Predicting Human Psychometric Properties Using Computational Language Models
計算言語モデルを用いた人間の心理的特性の予測
0.63
17 Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach.
17 liu y, ott m, goyal n, du j, joshi m, chen d, levy o, lewis m, zettlemoyer l, stoyanov v (2019) roberta: 堅牢に最適化されたbertプリトレーニングアプローチ。 訳抜け防止モード: 17 Liu Y, Ott M, Goyal N, Du J ジョシM、チェンD、レヴィO、ルイスM Zettlemoyer L, Stoyanov V (2019 ) RoBERTa : 頑健に最適化されたBERT事前訓練アプローチ
0.60
arXiv preprint arXiv:190711692
arXiv preprint arXiv:190711692
0.39
Mart´ınez-Plumed F, Prudˆencio RB, Mart´ınez-Us´o A, Hern´andez-Orallo J (2019) Item response theory in AI: Analysing machine learning classifiers at the instance level.
Mart ́ınez-Plumed F, Prudiencio RB, Mart ́ınez-Us ́o A, Hern ́andez-Orallo J (2019) AIにおける項目応答理論: インスタンスレベルでの機械学習分類器の分析。
0.67
Artificial Intelligence 271:18–42
人工知能271:18-42
0.44
Merkx D, Frank SL (2021) Human Sentence Processing: Recurrence or Attention?
merkx d, frank sl (2021) 人間の文処理: 再発か注意か?
0.68
In: Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, Association for Computational Linguistics, Online, pp 12–22
In:Proceeds of the Workshop on Cognitive Modeling and Computational Linguistics, Association for Computational Linguistics, Online, pp 12–22 訳抜け防止モード: 認知モデリングと計算言語学に関するワークショップ」の開催にあたって association for computational linguistics , online , pp 12-22
0.83
Nie Y, Williams A, Dinan E, Bansal M, Weston J, Kiela D (2020) Adversarial NLI: A New Benchmark for Natural Language Understanding.
Nie Y, Williams A, Dinan E, Bansal M, Weston J, Kiela D (2020) Adversarial NLI: A New Benchmark for Natural Language Understanding (英語) 訳抜け防止モード: Nie Y, Williams A, Dinan E, Bansal M, Weston J Kiela D (2020 ) Adversarial NLI : 自然言語理解のための新しいベンチマーク
0.76
In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics
第58回計算言語学会年次大会紀要 訳抜け防止モード: 第58回計算言語学会年会紀要 計算言語学連合
0.36
Nie Y, Zhou X, Bansal M (2020) What Can We Learn from Collective Human Opinions on Natural Language Inference Data.
Nie Y, Zhou X, Bansal M (2020) 自然言語推論データに関する集合的人間の意見から学ぶことができるもの
0.78
In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 9131–9143
In:Proceings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 9131-9143
0.42
Oliveira CS, Ten´orio CC, Prudˆencio R (2020) Item Response Theory to Estimate the
Oliveira CS, Ten ́orio CC, Prudiencio R (2020) アイテム応答理論
0.70
Latent Ability of Speech Synthesizers.
音声合成装置の潜在能力
0.60
In: ECAI Otani N, Nakazawa T, Kawahara D, Kurohashi S (2016) Irt-based aggregation model of crowdsourced pairwise comparison for evaluating machine translations.
In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 511–520
In: 2016 Conference on Empirical Methods in Natural Language Processing, pp 511-520 に参加して
0.92
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
raffel c, shazeer n, roberts a, lee k, narang s, matena m, zhou y, li w, liu pj (2020) は、統一されたテキストからテキストへのトランスフォーマーによる転送学習の限界を探求している。
0.68
Journal of Machine Learning Research 21(140):1–67
機械学習研究雑誌21(140):1-67
0.81
Rasch G (1993) Probabilistic models for some intelligence and attainment tests.
Rasch G (1993) いくつかのインテリジェンスと達成テストのための確率モデル。
0.75
ERIC Rizopoulos D (2006) ltm: An r package for latent variable modeling and item response
ERIC Rizopoulos D (2006) ltm: 潜在変数モデリングとアイテム応答のためのrパッケージ
0.81
theory analyses. Journal of statistical software 17(5):1–25
理論分析。 Journal of statistics software 17(5):1–25
0.40
Rodriguez P, Barrow J, Hoyle AM, Lalor JP, Jia R, Boyd-Graber J (2021) Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?
In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, pp 4486–4503
第59回計算言語学会年次大会紀要と第11回自然言語処理国際合同会議(第1巻:長期論文), association for computational linguistics, online, pp 4486-4503
0.66
Rupp AA, Templin JL (2008) Unique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art.
Rupp AA, Templin JL (2008) 診断分類モデルの特異な特徴:現状を概観する。
0.68
Measurement 6(4):219–262 Rust J, Golombok S (2014) Modern psychometrics: The science of psychological assess-
測定 6(4):219–262 Rust J, Golombok S (2014) 現代心理学:心理学的評価の科学-
0.83
ment. Routledge
メント Routledge
0.46
Schrimpf M, Blank I, Tuckute G, Kauf C, Hosseini EA, Kanwisher N, Tenenbaum J, Fedorenko E (2020a) Artificial neural networks accurately predict language processing in the brain.
Schrimpf M, Blank I, Tuckute G, Kauf C, Hosseini EA, Kanwisher N, Tenenbaum J, Fedorenko E (2020a) 人工ニューラルネットワークは脳内の言語処理を正確に予測する。
Schrimpf M, Blank I, Tuckute G, Kauf C, Hosseini EA, Kanwisher N, Tenenbaum J, Fedorenko E (2020b) The neural architecture of language: Integrative reverseengineering converges on a model for predictive processing.
schrimpf m, blank i, tuckute g, kauf c, hosseini ea, kanwisher n, tenenbaum j, fedorenko e (2020b) the neural architecture of language: integrative reverseengineeringは予測処理のモデルに収束する。
2020.06.26.174482, URL https://www.biorxiv. org/content/2020/10/ 09/2020.26.174482, https://www.biorxiv. org/content/2020/09/ 2020.06.174482.full. pdf
0.13
Sedoc J, Ungar L (2020) Item Response Theory for Efficient Human Evaluation of Chatbots.
sedoc j, ungar l (2020) チャットボットの効率的な人間評価のための項目応答理論。
0.77
In: Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pp 21–33
第1回NLPシステム評価比較ワークショップの報告, pp21-33
0.60
Vania C, Htut PM, Huang W, Mungra D, Pang RY, Phang J, Liu H, Cho K, Bowman SR (2021) Comparing Test Sets with Item Response Theory.
vania c, htut pm, huang w, mungra d, pang ry, phang j, liu h, cho k, bowman sr (2021) テストセットと項目応答理論を比較した。
0.67
In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, pp 1141–1158
第59回計算言語学会年次大会紀要と第11回自然言語処理国際合同会議(第1巻:長期論文), association for computational linguistics, online, pp 1141-1158
0.67
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser u, Polosukhin I (2017) Attention is All You Need.
vaswani a, shazeer n, parmar n, uszkoreit j, jones l, gomez an, kaiser u, polosukhin i (2017) 注意が必要なのはそれだけです。
0.62
In: Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA, NIPS’17, p 6000–6010
第31回神経情報処理システム国際会議紀要 : curran associates inc., red hook, ny, usa, nips’17, p 6000-6010 訳抜け防止モード: 第31回ニューラル情報処理システム国際会議に参加して Curran Associates Inc., Red Hook, NY, USA, NIPS’17,p 6000–6010
0.75
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S (2018) GLUE: A multi-task benchmark and analysis platform for natural language understanding.
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S (2018) GLUE: 自然言語理解のためのマルチタスクベンチマークと分析プラットフォーム。
0.84
In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics, Brussels, Belgium, pp 353– 355, DOI 10.18653/v1/W18-5446 , URL https://www.aclweb.o rg/anthology/ W18-5446
In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics, Brussels, Belgium, pp 353–355, DOI 10.18653/v1/W18-5446 , URL https://www.aclweb.o rg/anthology/W18-544 6 訳抜け防止モード: 2018年EMNLPワークショップ「ブラックボックスNLP」開催報告 NLPのためのニューラルネットワークの解析と解釈 Association for Computational Linguistics, Brussels, Belgium, pp 353–355 DOI 10.18653 / v1 / W18 - 5446 , URL https://www.aclweb.o rg/anthology/ W18 - 5446
0.63
Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, Levy O, Bowman SR (2019) Superglue: A stickier benchmark for general-purpose language understanding systems.
Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, Levy O, Bowman SR (2019) Superglue: 汎用言語理解システムのためのステッカーベンチマーク。
0.80
In: Proceedings of NeurIPS
in: neurips のプロシージャ
0.52
Wilcox EG, Gauthier J, Hu J, Qian P, Levy R (2020) On the predictive power of neural
Wilcox EG, Gauthier J, Hu J, Qian P, Levy R (2020) 神経の予測力について
0.81
language models for human real-time comprehension behavior.
人間のリアルタイム理解行動のための言語モデル
0.71
2006.01912
2006.01912
0.29
Williams A, Nangia N, Bowman S (2018) A broad-coverage challenge corpus for sentence understanding through inference.
Williams A, Nangia N, Bowman S (2018) 推論による文理解のための広範囲にわたる挑戦コーパス。
0.82
In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, pp 1112–1122
In:Proceings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, pp 1112–1122
0.43
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush AM (2020) Transformers: State-of-the-art natural language processing.
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush AM (2020) Transformers: State-of-the-art natural language processing。 訳抜け防止モード: wolf t, debut l, sanh v, chaumond j, delangue c, moi a, cistac p, rault t, louf r, funtowicz m, davison j, shleifer s, von platen p, ma c, jernite y, plu j, xu c, scao tl, gugger s, drame m, lhoest q, rush am (2020) トランスフォーマー : state - of -the - art natural language processing。
0.70
In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, pp 38–45, URL https://www.aclweb.o rg/ anthology/2020.emnlp -demos.6
In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, pp 38–45, URL https://www.aclweb.o rg/ anthology/2020.emnlp -demos.6 訳抜け防止モード: 自然言語処理における実証的手法に関する2020年会議の成果 : システムデモ Association for Computational Linguistics , Online , pp 38–45 , URL https://www.aclweb.o rg/ anthology/2020.emnlp - demos.6
0.72
Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding.
Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding 訳抜け防止モード: yang z, dai z, yang y, carbonell jg, salakhutdinov r, le qv (2019) xlnet : 言語理解のための一般化された自己回帰的事前学習。
0.54
In: Advances in Neural Information Processing Systems, vol 32, pp 5753–5763
in: ニューラル情報処理システムの進歩, vol 32, pp 5753-5763
0.82
Yarotsky D (2021) Universal approximations of invariant maps by neural networks.
Yarotsky D (2021) ニューラルネットワークによる不変写像の普遍近似。
0.67
Constructive Approximation pp 1–68
構成近似 pp 1–68
0.75
Zhou X, Zhang Y, Cui L, Huang D (2020) Evaluating commonsense in pre-trained language models.
Zhou X, Zhang Y, Cui L, Huang D (2020) 事前訓練された言語モデルの常識を評価する。
0.82
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 9733–9740
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 9733-9740