Forming a reliable judgement of a machine learning (ML) model's
appropriateness for an application ecosystem is critical for its responsible
use, and requires considering a broad range of factors including harms,
benefits, and responsibilities. In practice, however, evaluations of ML models
frequently focus on only a narrow range of decontextualized predictive
behaviours. We examine the evaluation gaps between the idealized breadth of
evaluation concerns and the observed narrow focus of actual evaluations.
Through an empirical study of papers from recent high-profile conferences in
the Computer Vision and Natural Language Processing communities, we demonstrate
a general focus on a handful of evaluation methods. By considering the metrics
and test data distributions used in these methods, we draw attention to which
properties of models are centered in the field, revealing the properties that
are frequently neglected or sidelined during evaluation. By studying these
properties, we demonstrate the machine learning discipline's implicit
assumption of a range of commitments which have normative impacts; these
include commitments to consequentialism, abstractability from context, the
quantifiability of impacts, the limited role of model inputs in evaluation, and
the equivalence of different failure modes. Shedding light on these assumptions
enables us to question their appropriateness for ML system contexts, pointing
the way towards more contextualized evaluation methodologies for robustly
examining the trustworthiness of ML models
BEN HUTCHINSON, Google Research, Australia NEGAR ROSTAMZADEH, Google Research, Canada CHRISTINA GREER, Google Research, USA KATHERINE HELLER, Google Research, USA VINODKUMAR PRABHAKARAN, Google Research, USA
BEN HUTCHINSON, Google Research, Australia NEGAR ROSTAMZADEH, Google Research, Canada CHRISTINA GREER, Google Research, USA KATHERINE HELLER, Google Research, USA VINODKUMAR PRABHakaRAN, USA, Google Research
0.42
2 2 0 2 y a M 1 1
2 2 0 2 y a m 1 1 である。
0.54
] G L . s c [
] G L。 sc [
0.47
1 v 6 5 2 5 0
1 v 6 5 2 5 0
0.42
. 5 0 2 2 : v i X r a
. 5 0 2 2 : v i X r a
0.42
Forming a reliable judgement of a machine learning (ML) model’s appropriateness for an application ecosystem is critical for its responsible use, and requires considering a broad range of factors including harms, benefits, and responsibilities.
In practice, however, evaluations of ML models frequently focus on only a narrow range of decontextualized predictive behaviours.
しかし実際には、MLモデルの評価は、限られた範囲の非文脈化予測行動のみに焦点を当てることが多い。
0.62
We examine the evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations.
評価対象の理想化範囲と実際の評価対象の狭間における評価ギャップについて検討した。
0.76
Through an empirical study of papers from recent high-profile conferences in the Computer Vision and Natural Language Processing communities, we demonstrate a general focus on a handful of evaluation methods.
By considering the metrics and test data distributions used in these methods, we draw attention to which properties of models are centered in the field, revealing the properties that are frequently neglected or sidelined during evaluation.
By studying these properties, we demonstrate the machine learning discipline’s implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism, abstractability from context, the quantifiability of impacts, the limited role of model inputs in evaluation, and the equivalence of different failure modes.
Shedding light on these assumptions enables us to question their appropriateness for ML system contexts, pointing the way towards more contextualized evaluation methodologies for robustly examining the trustworthiness of ML models.
To address such questions, model evaluations use a variety of methods, and in doing so make technical and normative assumptions that are not always explicit.
These implicit assumptions can obscure the presence of epistemic gaps and motivations in the model evaluations, which, if not identified, constitute risky unknown unknowns.
Although leaderboards support the need of the discipline to iteratively optimize for
リーダーボードは反復的に最適化する規律の必要性を支持していますが
0.55
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
For example, over-reliance on a small number of evaluation metrics can lead to gaming the metric (cf. Goodhart’s Law “when a measure becomes a target, it ceases to be a good measure”) [161]; this can happen unintentionally as researchers pursue models with “state-of-the-art” performance.
Benchmarks that encourage narrowly optimizing for test set accuracy can also lead to models relying on spurious signals [31], while neglecting the challenge of measuring the full range of likely harms [22].
Birhane et al find evidence for this in their study of the discourse of ML papers, showing that the field centers accuracy, generalization, and novelty, while marginalizing values such as safety [18].
Given that benchmark evaluations serve as proxies for performance on underlying abstract tasks [151], evaluating against a range of diverse benchmarks for each task might help mitigate biases within each benchmark.
However, ML research disciplines seem to be trending towards relying on fewer evaluation benchmark datasets [93], with test set reuse potentially leading to a research community’s overfitting with respect to the general task [103, 177].
Furthermore, within each benchmark, items are weighted equally (thus focusing on the head of the data distribution), failing to capture inherent differences in difficulty across items, and hence providing poor measures of progress on task performance [141].
As Raji et al point out, the ML research discipline’s decontextualized and non-systematic use of benchmark data raises serious issues with regards to the validity of benchmarks as measures of progress on general task performance [135].
This paper complements and extends this range of critiques, considering the risks of application developers adopting the ML research community’s standard evaluation methodologies.
We seek to address challenges in measuring technology readiness (tram) [104, 140], while acknowledging this cannot be reduced to a purely technical question [43, 140].
By studying and analyzing the ML research community’s evaluation practices, we draw attention to the evaluation gaps between ideal theories of evaluation and what is observed in ML research.
By considering aspects of evaluation data and evaluation metrics—as well as considerations of evaluation practices such as error analysis and reporting of error bars—we highlight the discrepancies between the model quality signals reported by the research community and what is relevant to real-world model use.
Our framework for analyzing the gaps builds upon and complements other streams of work on ML evaluation practices, including addressing distribution shifts between development data and application data [34, 94, 160], and robustness to perturbations in test items [118, 132, 173? ].
We situate this work alongside studies of the appropriateness of ML evaluation metrics (e g , [47, 88, 177]), noting that reliable choice of metric is often hampered by unclear goals [44, 97].
In foregrounding the information needs of application developers, we are also aligned with calls for transparent reporting of ML model evaluations [117], prioritizing needs of ML fairness practitioners [77], model auditing practices [136], and robust practices for evaluating ML systems for production readiness [23].
In Section 2, we consider various ideal goals that motivate why ML models are evaluated, discussing how these goals can differ between research contexts and application contexts.
By comparing the ideal goals of evaluation with the observed evaluation trends in our study, we highlight in Section 4 the evaluation gaps that present challenges to evaluations being good proxies for what application developers really care about.
We identify six implicit evaluation assumptions that could account for the presence of these gaps.
これらのギャップの存在を考慮に入れうる6つの暗黙的な評価仮定を同定する。
0.63
Finally, in Section 5, we discuss various techniques and methodologies that may help to mitigate these gaps.
最後に第5節では,これらのギャップを緩和するための様々な手法と方法論について論じる。
0.71
2 IDEALS OF ML MODEL EVALUATION
MLモデル評価の2つの試み
0.68
Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.
Although this paper is ultimately concerned with practical information needs when evaluating ML models for use in applications, it is useful to first step back and consider the ultimate motivations and goals of model evaluation.
In this paper, we will speak of a model evaluation as a system of arbitrary structure that takes a model as an input and produces outputs of some form to judge the model.
The evaluation might be motivated by various stakeholder perspectives and interests [91].
評価は様々な利害関係者の視点や関心に動機づけられる可能性がある[91]。
0.70
The output might, for example, produce a single metric and an associated numeric value, or a table of such metrics and values; it might include confidence intervals and significance tests on metric values; and it might include text.
a) aim to shed light on the training data (for example ML model evaluations can shed light on the data-generation practices used by institutions [5]), or
a) トレーニングデータに光を当てること(例えば、mlモデルの評価は、機関が使用するデータ生成プラクティスに光を当てることができる [5])
0.74
b) “Green AI” explorations of how the learner can efficiently use limited amounts of resources [152].
b) “green ai” 学習者が限られた量のリソースを効率的に利用できる方法を探求する[152]。
0.77
However, when we evaluate a model without a specific application in mind, we lose the opportunity to form judgements specific to a use case.
On the other hand, application-centric evaluations are concerned with how the model will operate within an ecosystem consisting of both human agents and technical components (Figure 1), sometimes described as the “ecological validity” [46].
Applications often use scores output by the model to initiate discrete actions or decisions, by applying a specific classification threshold to the scores.1
Table 1. Summary of typical goals of the idealized learner-centric and application-centric evaluations.
表1。 理想化された学習者中心およびアプリケーション中心の評価の典型的な目標の概要。
0.66
This distinction between learner-centric and application-centric is related (albeit imperfectly) to the different objectives of model evaluations that concern the engineering and science disciplines [113, 168].
Note that we are not claiming (cf. the debate in [122]) that science lies outside the bounds of statistical/ML methods, but rather that scientificflavored pursuits have distinct uses of such methods [24].
Debates between AI practitioners about the relationships between AI, science, and statistical methods have a long history, for example Diana Forsythe’s studies of 1980s AI labs [56].
Important to this debate regarding the scientific goals of ML is the question of construct validity; that is, whether our measurements actually measure the things that we claim they do [85, 86, 135].
Conversely, consequential validity—which includes the real-world consequences of an evaluation’s interpretation and use—is likely more important to considerations of accountability and governance of ML models in applications [86].
This distinction is closely related to one between “scientific testing” and “competitive testing” made by Hooker in 1995, who takes the position that competitive testing
b) does not constitute true research but merely development [78].
b) 真の研究を成すのではなく、単に展開すること[78]
0.71
However, since engineering research has its own goals, distinct from those of science [26], a more defensible position is that evaluations in support of scientific research are distinct from evaluations in support of engineering research.
Table 1 summarizes the above distinctions and the relationships between them.
表1は上記の区別とそれらの関係をまとめたものである。
0.70
The distinction between learnercentric and application-centric evaluations relates to the question of internal validity and external validity that is more commonly discussed in the social sciences than in ML (see, e g , [123]) but also sometimes in ML [103].
This is reflected in the ways in which practitioners of the two types of evaluations discuss the topic of robustness.
これは2種類の評価の実践者が堅牢性について論じる方法に反映されている。
0.73
Learnercentric evaluations pay attention to the robustness of the learner to changes in the training data (e g , distributional shifts, outliers, perturbations, poisoning attacks; and with connections to robust estimation of statistics [101]), while application-centric evaluations pay attention to desired behaviors such as the (in)sensitivity of the model to certain classes of perturbations of the input, or to sensitive input features (e g , [61]).
Note that nothing in the ideals of evaluation described above has stipulated whether evaluations are quantitative or qualitative.
上記の評価の理想に、評価が量的か質的かは規定されていないことに注意。
0.69
For example, one could imagine interrogating a chatbot model using qualitative techniques, or adopting methodologies of political critique such as [41].
Similarly, nothing has stipulated what combinations of empirical or deductive methods are used.
同様に、実験的な方法や導出的な方法の組み合わせが使われるかは規定されていない。
0.52
3 ML MODEL EVALUATIONS IN PRACTICE
3 mlモデルによる実運用評価
0.69
Beneath the technical issues lie some differences in values concerning not only the meaning but also the relative merit of “science” and “artificial intelligence.” — Diana Forsythe [56]
To shed light on the ML research community’s norms and values around model evaluation, we looked at how these communities report their model evaluations.
These include: a survey 144 research papers studying the properties of models that are tested for [177]; a review of 107 papers from Computer Vision (CV), Natural Language Processing (NLP) and other ML disciplines to diagnose internal and external modes of evaluation failures [103]; an analysis of whether 60 NLP and CV papers pay attention to accuracy or efficiency [152]; and an analysis of the Papers With Code dataset2 for patterns of benchmark dataset creation and re-use [93].
例えば、[177]でテストされるモデルの特性を研究する144の研究論文、[177]でテストされるモデルの性質を研究するコンピュータビジョン(CV)、自然言語処理(NLP)など、評価失敗の内部および外部モードを診断するためのMLの規律に関する107の論文のレビュー[103]、60のNLPおよびCV論文が精度や効率に注意を払うかどうかの分析[152]、ベンチマークデータセットの作成と再利用のパターンに関するPapers With Codeデータセット2の分析[93]。
0.85
3.1 Method 3.1.1 Data.
3.1 メソッド 3.1.1 データ。
0.53
We sampled 200 research papers, stratified by discipline, conference and year.
我々は、規律、会議、年によって階層化された200の研究論文をサンプリングした。
0.49
100 papers were selected from each of the NLP and CV disciplines.
NLPおよびCVの分野から100の論文が選択された。
0.74
We selected 20 papers from the proceedings of each of the 55th to 59th Annual Meetings of the Association of Computational Linguistics (ACL’2017–ACL’2021), 25 papers at random from each of the proceedings of the 2019–2021 IEEE Conferences on Computer Vision and Pattern Recognition (CVPR’2019–CVPR’2021), and 25 papers from the 24th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI’2021).
These conferences represent the pinnacles of their respective research fields.3
これらの会議はそれぞれの研究分野の要点を表している。3
0.62
3.1.2 Analysis. The authors of this paper performed this analysis, dividing the papers among themselves based on disciplinary familiarity.
3.1.2 分析。 本論文の著者はこの分析を行い,学際的親しみに基づく各論文の分割を行った。
0.51
Using an iterative procedure of analysis and discussion, we converged on a set of labels that captured important aspects of evaluations across and within disciplines.
分析と議論の反復的な手順を用いて,各分野における評価の重要な側面を捉えたラベルセットに収束した。
0.85
Recall from Section 2 that, for our purposes, a single evaluation typically involves choosing one or more metrics and one or more datasets.
c) Analysis: Was statistical significance of differences reported?
c)分析:報告された相違の統計的意義は?
0.79
Were error bars and/or confidence intervals reported?
エラーバーや信頼区間は報告されたか?
0.55
Was error analysis performed?
エラー分析は行われたか?
0.67
Were examples of model performance provided to complement measurements with qualitative information?
定性情報による測定を補完するモデル性能の例はありますか?
0.82
3.2 Results Although each of the disciplines and conferences does not define itself solely in terms of ML, the practice of reporting one or more model evaluations in a research paper is ubiquitous.
Only five papers did not include evaluations of ML models; of these two were published at ACL (a survey paper, a paper aimed at understanding linguistic features, and one on spanning-tree algorithms), and two at CVPR (a paper with only qualitative results, and one introducing a dataset).
Counts are non-exclusive, for example papers frequently reported multiple metrics and sometimes reported performance both on I.I.D. test data and on non-I.I.D. test data.
Appendix B contains an overview of the flavors of test data we observed.
Appendix Bには、私たちが観察したテストデータのフレーバーの概要が含まれています。
0.59
We found evidence to support the claim that evaluations of NLP models have “historically involved reporting the performance (generally meaning the accuracy) of the model on a specific held-out [i.e., I.I.D.] test set” [20, p. 94].4
我々は,NLPモデルの評価が「特定のホールドアウト(I.I.D.)テストセット」[20, p. 94].4において,モデルの性能(一般的には精度)を報告している」という主張を支持する証拠を発見した。
0.81
CV evaluations seem to be even more likely to utilize I.I.D. test data, and—consistent with [93]—CV papers typically either introduce a new task (and corresponding 2https://paperswithc ode.com 3ACL and CVPR are rated A∗ (“flagship conference”), and MICCAI is rated A (“excellent conference”), by core.edu.au; all three are in the top 30 computer science conferences out of over 900 listed on research.com.
AUC Description Sensitive to the sum TP+TN and to N. Not sensitive to class imbalance.
AUC クラス不均衡に敏感でない TP+TN と N に感性を記述する。
0.62
Sensitive to TP and FP.
TPおよびFPに対する感受性。
0.80
Not sensitive to FN or TN.
FNやTNに敏感ではない。
0.76
Sensitive to TP and FN.
TPおよびFNに対する感受性。
0.69
Not sensitive to FP or TN.
FPやTNに敏感ではない。
0.72
Sensitive to TP, FP and FN.
TP、FP、FNに対する感受性。
0.74
Not sensitive to TN.
TNに敏感ではない。
0.72
Sensitive to intersection and overlap of predicted and actual.
予測と実際の交差と重なりに敏感である。
0.70
Sensitive to the probability that the model assigns to the test data.
モデルがテストデータに割り当てる確率に敏感である。
0.68
Examples Accuracy, error rate Precision, Bleu Recall, Rouge 𝐹1, 𝐹𝛽 Dice, IoU Perplexity MSE, MAE, RMSE, CD Sensitive to the distance between the prediction and the actual value.
Sensitive to each of TP, TN, FP and FN, but unlike Accuracy metrics Pearsons 𝑟, Spearman’s 𝜌 they factor in the degree of agreement that would be expected by chance.
An exception to this trend was CV papers which explored shared representations (e g , in multi-task learning [53, 99] or domain adaptation [119, 126]).
この傾向の例外として,共有表現(マルチタスク学習[53,99],ドメイン適応[119,126])を探索したCVペーパーがある。 訳抜け防止モード: この傾向の例外は、共有表現(例えば、)を探索するCVペーパーである。 in multi - task learning [ 53, 99 ] or domain adaptation [ 119, 126 ] ]
0.85
Evaluations in both disciplines showed a heavy reliance on reporting point estimates of metrics, with variance or error bars typically not reported in our sample.
While colloquial uses of phrases like “significantly better” were fairly common, most papers did not report on technical calculations of statistical differences; we considered only those latter instances when coding whether a paper reported significance.
For example, accuracy does not distinguish between FP and FN; 𝐹1 is symmetric in FP and FN (they can be swapped without affecting 𝐹1); the Overlap metrics are similary invariant to swapping of the predicted bounding box and the reference bounding box; the Distance category of metrics does not distinguish over-estimation from under-estimation on regression tasks.
From our reading of the 200 papers in our sample, one qualitative observation we had was that model evaluations typically do not include concrete examples of model behavior, nor analyses of errors (for a counterexample which includes these practices, see [35]).
Also, we noted the scarcity of papers whose sole contribution is a new dataset for an existing task, aligning with previous observations that dataset contributions are not valued highly within the community [147].
We hypothesise that conference reviewers place emphasis on novelty of model, task, and/or metric.
カンファレンスレビュアーがモデル、タスク、および/またはメトリクスの斬新性に重点を置くと仮定する。
0.51
We note a general tension between disciplinary values of task novelty and demonstrating state-of-the-art performance by outperforming previous models, and the risk of overfitting from test set re-use discussed by [103].
Test data was often old (e g , the CONLL 2003 English NER dataset [149] used in two papers); optimizing for these static test sets fails to account for societal and linguistic change [14].
テストデータは古いものが多い(例: CONLL 2003 English NER データセット[149]を2つの論文で用いた)。 訳抜け防止モード: テストデータは、しばしば古い (例: CONLL 2003 English NER data [149 ] used in two papers) 静的なテストセットを最適化し 社会的・言語的な変化を 説明できません [14]
0.80
Disaggregation of metrics was rare, and fairness analyses were absent despite our sample being from 2017 onward, concurrent with mainstream awareness of ML fairness concerns.
0 6 resource-efficiency that are typical of engineering disciplines [26], suggesting that the ML research disciplines generally aspire to scientific goals concerning understanding and explaining the learner.
With this lens, the disciplinary paradigm of measuring accuracy on I.I.D. test data is not surprising: the goal is to assess a model’s ability to generalize.
This assessment would then give us good guarantees on the application’s behavior, if the practical challenges of ascertaining the data distributions in an application ecosystem can be overcome.
In practice, however, these challenges can be severe, and the research papers we surveyed do not generally tackle questions of uncertainty regarding data distributions.
4 GAPS AND ASSUMPTIONS IN COMMON EVALUATION PRACTICES
4 共通評価実践におけるギャップと前提
0.58
In theory there is no difference between theory and practice, while in practice there is.
理論的には理論と実践には違いはないが、実際はそうである。
0.84
— Brewster (1881) [25]
-ブリュースター(1881年)〔25年〕
0.49
We now consider whether the research evaluation practices observed in Section 3 are aligned with the needs of decision-makers who consider whether to use a model in an application.
That is, we consider whether the typically learner-centric evaluations, which commonly use metrics such as accuracy or 𝐹1 on test data I.I.D. with the training data, meet the need of application-centric evaluations.
In doing so, we expose, in a novel way, the interplay of technical and normative considerations in model evaluation methodologies.
そこで我々は,モデル評価手法における技術的および規範的考察の相互作用を,新しい方法で明らかにした。
0.79
4.1 Assumptions in Model Evaluation We introduce six assumptions in turn, describing both how they operate individually in evaluations and how they compose and compound.
Our starting point is the observation from Section 2 that the goal of application-centric model evaluations is to understand how a model will interact with its ecosystem, which we denote schematically as:
In adopting consequentialism as its de facto ethical framework, ML prioritizes the greatest good for the greatest number [84] and centers measurable future impacts.
This is realised as a focus on the first-order consequences of introducing the model into the ecosystem.
これは、モデルをエコシステムに導入する1次的な結果に焦点をあてるものとして実現されている。
0.57
Changes to the ecosystem itself—e g , addressing what social change is perceived as possible and desirable [49, 68, 79]—are assumed to be out of scope, as are concerns for setting of precedents for other ML developers.
[156]. Schwartz et al coin the phrase “Red AI” to describe ML work that disregards the costs of training, noting that such work inhibits discussions of when costs might outweigh benefits [152].
Another outcome of focusing primarily on direct consequences is marginalizing the assessment of a model against the social contracts that guide the ecosystem in which the model is used, such as moral values, principles, laws, and social expectations.
The model itself is reduced to a predicted value ˆ𝑌, ignoring e g , secondary model outputs such as confidence scores, or predictions on auxiliary model heads.
Also, reducing an ecosystem to model inputs and “ground truth” overlooks questions of system dynamics [111, 154], such as feedback loops, “humans-in-the-loop,” and other effects “due to actions of various agents changing the world” [15].
By positing a variable 𝑌 = 𝑦 which represents the “ground truth” of a situation—even in situations involving social phenomena—a positivist stance on knowledge is implicitly adopted.
社会的現象を含む状況であっても、状況の「根拠真実」を表す変数 y = y を仮定することで、知識に対する実証主義的なスタンスが暗黙的に採用される。
0.61
That is, a “true” value 𝑌 = 𝑦 is taken to be objectively singular and knowable.
すなわち、「真」の値 Y = y は客観的に特異であり、可知であると見なされる。
0.74
This contrasts with anthropology’s understanding of knowledge as socially and culturally dependent [57] and requiring interpretation [63].
In the specific cases of CV and NLP discussed in Section 3, cultural aspects of image and language interpretation are typically marginalized (cf. [11, 16, 89, 100], for example), exemplifying what Aroyo and Welty call AI’s myth of “One Truth” [7].
Furthermore, the positivist stance downplays the importance of questions of construct validity and reliability [58, 86].
さらに,実証主義的な立場は,構成的妥当性と信頼性の問題の重要性を軽視する[58,86]。
0.67
Assumption 3: Input Myopia.
推定3:入力ミオピア。
0.54
Once the input variable 𝑋 has been used by the model to calculate the model prediction ˆ𝑌, 𝑋 is typically
入力変数 X がモデルによってモデル予測に使用されると、通常 X は X である。
0.79
8 Fig. 2. Causal graph illustrating the Input Myopia Assumption.
8 図2。 入力近視の仮定を示す因果グラフ。
0.54
英語(論文から抽出)
日本語訳
スコア
Evaluation Gaps in Machine Learning Practice
機械学習実践における評価ギャップ
0.85
FAccT ’22, June 21–24, 2022, Seoul, Republic of Korea
FAccT'22, 6月21-24, 2022, ソウル, 大韓民国
0.85
ignored for the remainder of the evaluation.
残りの評価は無視された。
0.67
That is, the utility of the model is assumed to depend only on the model’s prediction and on the “ground truth.”
つまり、モデルの有用性は、モデルの予測と“地下の真実”にのみ依存していると仮定される。
0.69
We illustrate this with a causal graph diagram in Figure 2, which shows Utility as independent of 𝑋 once the effects of ˆ𝑌 and 𝑌 are taken into account.
図2の因果グラフ図でこれを説明し、 yY と Y の効果を考慮すると、ユーティリティは X から独立であることを示す。
0.69
𝑈 𝑡𝑖𝑙𝑖𝑡𝑦( ˆ𝑌, 𝑋, 𝑌) ≈ 𝑈 𝑡𝑖𝑙𝑖𝑡𝑦( ˆ𝑌, 𝑌)
𝑈 𝑡𝑖𝑙𝑖𝑡𝑦( ˆ𝑌, 𝑋, 𝑌) ≈ 𝑈 𝑡𝑖𝑙𝑖𝑡𝑦( ˆ𝑌, 𝑌)
0.43
(Input Myopia Assumption) Evaluation Gap 5: Disaggregated Analyses.
(入力近視仮説) 評価ギャップ5:分散分析。
0.63
By reducing the variables of interest to the evaluation to the prediction 𝑌 and the ground truth ˆ𝑌, the downstream evaluation is denied the potential to use 𝑋.
This exacerbates Evaluation Gap 3 by further abstracting the evaluation statistics from their contexts.
これにより、評価統計を文脈からさらに抽象化することで評価ギャップ3が悪化する。
0.60
For example, 𝑋 could have been used to disaggregate the evaluation statistics in various dimensions—including for fairness analyses, assuming that socio-demographic data is available and appropriate [6, 9]—or to examine regions of the input space which raise critical safety concerns (e g , distinguishing a computer vision model’s failure to recognise a pedestrian on the sidewalk from failure to recognise one crossing the road) [3].
Similarly, robustness analyses which compare the model predictions for related inputs in the same neighborhood of the input space are also excluded.
同様に、入力空間の同一近傍における関連する入力のモデル予測を比較するロバストネス解析も除外される。
0.83
Assumption 4: Quantifiability. We have not yet described any modeling assumptions about the mathematical or topological nature of the implied 𝑈 𝑡𝑖𝑙𝑖𝑡𝑦 function, which up to now has been conceived as an arbitrary procedure producing an arbitrary output.
仮定4: 定量性。 我々は、暗黙の u 運動関数の数学的または位相的性質に関するモデリングの仮定をまだ記述していないが、これは任意の出力を生成する任意の手続きとして考えられている。
0.69
We observe, however, that when models are evaluated, there is a social desire to produce a small number of scalar scores.
We identify two assumptions here: first, that impacts on each individual can be reduced to a single numeric value (and thus different dimensions of impacts are commensurable5); second, that impacts across individuals are similarly commensurable.
We define ˆ𝑦 ∈ ˆ𝑌 and 𝑦 ∈ 𝑌 to be a specific model prediction, and a specific "ground truth" value respectively, leading to the Individual Quantifiability Assumption and the Collective Quantifiability Assumption, respectively.
我々は, 特定のモデル予測として, y ∈ y と y ∈ y をそれぞれ「根拠真理」の値として定義し, 個別の定量化可能性仮定と集合的定量化可能性仮定をそれぞれ導いた。
The Quantifiability Assumptions assume that the impacts on individuals are reducible to numbers, trivializing the frequent difficulty in comparing different benefits and costs [110].
Furthermore, the harms and benefits across individuals are assumed to be comparable in the same scale.
さらに、個人間の害と利益は、同じ規模で比較すると仮定される。
0.67
These assumptions are likely to disproportionately impact underrepresented groups, for whom model impacts might differ in qualitative ways from the well represented groups [74, 145, 146].
The former groups are less likely to be represented in the ML team [172] and hence less likely to have their standpoints on harms and benefits acknowledged.
For classification tasks, common evaluation metrics such as accuracy or error rate model the utility of ˆ𝑌 as binary (i.e., either 1 or 0), depending entirely on whether or not it is equal to the
分類タスクにおいて、精度や誤差率モデルなどの一般的な評価指標は、二進数としての sy の効用(1 または 0 のいずれか)であり、それがそれと等しいかどうかに完全に依存する。
0.69
“ground truth” 𝑌. That is, for a binary task, 𝑈 𝑡𝑖𝑙𝑖𝑡𝑦( ˆ𝑌=0, 𝑌=0)=𝑈 𝑡𝑖𝑙𝑖𝑡𝑦( ˆ𝑌=1, 𝑌=1)=1 and 𝑈 𝑡𝑖𝑙𝑖𝑡𝑦( ˆ𝑌=0, 𝑌=1)=𝑈 𝑡𝑖𝑙𝑖𝑡𝑦( ˆ𝑌=1, 𝑌=0)=0.
In multiclass classification, severely offensive predictions (e g , predicting an animal in an image of a person) are given the same weight as inoffensive ones.
In regression tasks, insensitivity to either the direction of the difference ˆ𝑦 − 𝑦 or the magnitude of 𝑦 can result in evaluations being possibly poor proxies for downstream impacts.
回帰タスクでは、差 :y − y の方向や y の大きさに対する無感性は、下流の影響に対する評価が不十分である可能性がある。
0.62
(One common application use case of regression models is to apply a cutoff threshold 𝑡 to the predicted scalar values, for which both the direction of error and the magnitude of 𝑦 are relevant.)
Taken collectively, the previous assumptions might lead one to use accuracy as an evaluation metric for a classification task.
まとめると、以前の仮定は、分類タスクの評価基準として正確性を使うことにつながるかもしれない。
0.62
Further assumptions can then be made in deciding how to estimate accuracy.
さらなる仮定は、精度を推定する方法を決定する際に行われる。
0.68
The final assumption we discuss here is that the test data over which accuracy (or other metrics) is calculated provides a good estimate of the accuracy of the model when embedded in the ecosystem.
′ = 𝑦 ′) (Assumption of Test Data Validity [Classification])
′ = 𝑦 ′) (試験データ有効性[分類]を仮定する)
0.70
where 𝑌 ′ = 𝑦′ and ˆ𝑌 ′ = ˆ𝑦′ are the ground truth labels and the model predictions on the test data, respectively.
ここで、Y ′ = y′ と y ′ = y′ は、それぞれテストデータ上の基底真理ラベルとモデル予測である。
0.85
Evaluation Gap 8: Data Drifts.
評価ギャップ8: データドリフト。
0.69
A simple model of the ecosystem’s data distributions is particularly risky when system feedback effects would cause the distributions of data in the ecosystem to diverge from those in the evaluation sample [92, 106].
We sketch this composition of assumptions in Figure 4, along with questions that illustrate the gaps raised by each assumption.
図4にこの仮定の構成をスケッチし、各仮定によって引き起こされるギャップを説明する質問を描きます。
0.66
Our reason for teasing apart these assumptions and their compounding effects is not to attack the “strawman” of naive application-centric evaluations which rely solely on estimating model accuracy.
For example: • Some robustness evaluations (for surveys, see [54, 170]) explicitly tackle the problem of distribution shifts, rejecting the Assumptions of Test Data Validity without questioning the other assumptions we have identified.
• Some sensitivity evaluations consider the effect on the model predictions of small changes in the input, but use accuracy as an evaluation metric, rejecting the Input Myopia Assumption without questioning the others [139].
A pseudo-formal notation Data sourcing and processing; invisible labour; consultation with impacted communities; motives; public acceptance; human rights.
Different flavors of impacts on a single person; different flavors of impacts across groups.
個人に対する影響は異なるが、グループ間で異なる影響がある。
0.63
Severe failure cases; confusion matrices; topology of the prediction space.
深刻な障害ケース、混乱行列、予測空間のトポロジー。
0.52
Data sampling biases; distribution shifts.
データサンプリングバイアス; 分散シフト。
0.68
Table 4. Sketch of how the six assumptions of Section 4—when taken collectively—compose to simplify the task of evaluating a model
表4。 第4節の6つの仮定がどう構成され、モデルを評価する作業が単純化されるかのスケッチ
0.52
Abstractability from Context Input Myopia
文脈入力マイオピアからの抽象性
0.55
(akin to pseudo-code) is used to enable rapid glossing of the main connections.
(擬似コードと同様)主接続の迅速な光沢化を可能にするために用いられる。
0.66
𝑌 = 𝑦 and ˆ𝑌 = ˆ𝑦 denote the true (unobserved) distributions of ground truth and model predictions, respectively, while the variables 𝑌 ′ = 𝑦′ and ˆ𝑌 ′ = ˆ𝑦′ denote the samples of reference labels and model predictions over which accuracy is calculated in practice.
Y = y と y = y はそれぞれ基底真理とモデル予測の真(観測されていない)分布を表し、変数 Y ′ = y′ と y ′ = y′ は基準ラベルのサンプルであり、精度が実際に計算されるモデル予測である。
0.87
The order of the assumptions reflects an increasing focus on technical aspects of model evaluation, and a corresponding minimizing of non-technical aspects.
仮定の順序は、モデル評価の技術的側面の増大と、非技術的側面の最小化を反映している。
0.65
Appendix C illustrates how each of the sets of considerations might apply in a hypothetical application of a computer vision model.
It may not be possible to avoid all of the assumptions all of the time; nevertheless unavoidable assumptions should be acknowledged and critically examined.
The six assumptions we have identified also provide a lens for assessing the consistency of some evaluation metrics with other assumptions that have been made during the evaluation, for example • Is 𝐹-score consistent with an utilitarian evaluation framework?
The 𝐹-score is mathematically a harmonic mean— which is often appropriate for averaging pairs of rates (e g , two speeds).
Fスコアは数学的には調和平均であり、平均的な2つの速度(例えば2つの速度)に適している。
0.64
When applied to Precision and Recall, however, the 𝐹-score constitutes a peculiar averaging of “apples and oranges,” since, when conceived as rates, Precision and Recall measure rates of change of different quantities, [130].
Since auroc is calculated by averaging over a range of possible threshold values, it “cannot be interpreted as having any relevance to any particular classifier” [129] (which is not saying auroc is irrelevant to evaluating the learner, cf. Section 2, nor to a learned model’s propensity to correctly rank positive instances above negative ones).
In both cases, we ask whether such metrics are of limited utility in application-centric evaluations and whether they are better left to learner-centric ones.
5 CONTEXTUALIZING APPLICATION-CENTRIC MODEL EVALUATIONS
5 コンテクチュアライズ応用中心モデル評価
0.66
the ornithologists were forced to adapt their behavior (for the sake of “science”) to the most primitive evaluation method which was the only one considered or known, or else throw their data away.
As discussed in Section 3, the different goals and values of academic ML research communities mean that research norms cannot be relied upon as guideposts for evaluating models for applications.
In this section, we propose steps towards evaluations that are rigorous in their methods and aim to be humble about their epistemic uncertainties.
本稿では,その方法に厳格な評価方法を提案するとともに,その不確実性に対する謙虚さを追求する。
0.64
In doing so, we expand on the call by Raji et al to pay more attention not just to evaluation metric values but also to the quality and reliability of the measurements themselves, including sensitivity to external factors [135].
5.1 Minding the Gaps between Evaluation Goals and Research Practice Documenting assumptions made during model evaluation is critical for transparency and enables more informed decisions.
If an assumption is difficult to avoid in practice, consider augmenting the evaluation with signals that may shed complementary light on questions of concern.
For example, even a handful of insightful comments from members of impacted communities can be an invaluable complement to evaluations using quantitative metrics.
We now consider specific mitigation strategies for each of the gaps in turn.
現在、各ギャップに対する具体的な緩和戦略を検討しています。
0.66
Minding Gap 1: Evaluate More than Conseqences.
minding gap 1: 簡潔性以上の評価を行う。
0.64
To reduce the gap introduced by the Consequentialism Assumption, evaluate the processes that led to the creation of the model, including how datasets were constructed [150].
We echo calls for more reflexivity around social and intentional factors around model development [116], more documentation of the complete lifecycle of model development [82, 167], and greater transparency around ML models and their datasets [13, 62, 117].
It may be appropriate to contemplate whether the model is aligned with the virtues the organization aspires to [165].
モデルが組織が [165] を志す美徳と一致しているかを熟考するのは適切かもしれません。
0.70
Consider the question of whether any ML model could be a morally appropriate solution in this application context, e g , whether it is appropriate to make decisions about one person on the basis of others’ behaviors [49].
Since reasoning about uncertain future states of the world is fraught with challenges [29], evaluations should consider indirect consequences and assess how the model upholds social obligations within the ecosystem.
This may involve processes such as assessments of human rights, social and ethical impact [109, 114], audits of whether the ML system upholds the organization’s declared values or principles [136], and/or assessments of the potential for privacy leakage (e g , [30, 175]).
To address the gap introduced by the Assumption of Abstractability from Context, consider externalities such as energy consumption [75, 152], as well as resource requirements [51].
Note that when substituting one model for another—or for displaced human labor—system stability can itself be a desirable property independent of model accuracy (and perhaps counter to tech industry discourses of “disruption” [64]), and a range of metrics exist for comparing predictions with those of a legacy model [47].
The more attention paid to the specifics of the application context, the better; hence, metrics which assume no particular classification threshold, such as AUC, may provide limited signal for any single context.
Acknowledge the subjectivities inherent in many tasks [2].
多くのタスクに内在する主観性 [2]。
0.63
An array of recent scholarship on subjectivity in ML has “embraced disagreement” through practices of explicitly modeling—in both the data model and the ML model—inter-subject variation in interpretations [7, 12, 45, 48, 55].
For the purposes of ML model evaluations, disaggregating labels on test data according to the cultural and socio-demographic standpoints of their annotators enables more nuanced disaggregated evaluation statistics [131].
For example, people may differ both in their preferences regarding model predictions ˆ𝑌 per se, as well as their preferences regarding model accuracy ˆ𝑌 = 𝑌 [17].6
例えば、人々はモデル予測に関する好みとモデルの正確性に関する好みの両方が異なるかもしれない。 訳抜け防止モード: 例えば、人々はモデル予測に関する好みにおいて、それぞれ異なるかもしれません。 モデル精度に関する好みだけでなく、Y = Y [ 17].6
0.78
As such—and independent of fairness considerations—evaluations should be routinely pay attention to different parts of the input distribution, including disaggregating along social subgroups.
Special attention should be paid to the tail of the distribution and outliers during evaluation, as these may require further analysis to diagnose the potential for rare but unsafe impacts.
Input sensitivity testing can provide useful information about the sensitivity of the classifier to dimensions of input variation known to be of concern (e g , gender in text [21, 66, 80, 180]).
Resist the temptation to reduce a model’s utility to a single scalar value, either for stack ranking [51] or to simplify the cognitive load on decision makers.
Acknowledge qualitative impacts that are not addressed by metrics (e g , harms to application users caused by supplanting socially meaningful human interactions), and rigorously assess the validity of attempts to measure social or emotional harms.
Be conservative in aggregations: consider plotting data rather than reporting summary statistics (cf. Anscombe’s quartet); do not aggregate unlike quantities; report multiple estimates of central tendency and variation; and don’t assume that all users of an application will have equal benefits (or harms) from system outcomes.
Consider applying aggregation and partial ranking techniques from the fair division literature to ML models, including techniques that give greater weight to those with the worst outcomes (e g , in the extreme case, “Maximin”) [50].
Minding Gap 7: Respect Differences Between Failures.
Minding Gap 7: 失敗の相違を無視する。
0.74
If the harms of false positives and false negatives are incommensurable, report them separately.
偽陽性と偽陰性の害が相容れない場合は、別々に報告する。
0.66
If commensurable, weight each appropriately.
快適な場合、それぞれに重みがある。
0.51
For multiclass classifiers, this approach generalizes to a classification cost matrix [163], and, more generally, including the confusion matrix before costs are assigned; for regression tasks, report metrics such as MSE disaggregated by buckets of 𝑌.
For transparency, do not assume it is obvious to others which datasets are used in training and evaluation; instead, be explicit about the provenance, distribution, and known biases of the datasets in use [6].
Consider Bayesian approaches to dealing with uncertainty about data distributions [90, 98, 115], especially when sample sizes are small or prior work has revealed systematic biases.
For example, an evaluation which uses limited data in a novel domain (or in an under-studied language) to investigate gender biases in pronoun resolution should be tentative in drawing strong positive conclusions about “fairness” due to abundant evidence of gender biases in English pronoun resolution models (e g [171]).
5.2 Alternate Model Evaluation Methodologies More radical excursions from the disciplinary paradigm are often worth considering, especially in scenarios with high stakes or high uncertainty.
Evaluation Remits. In 1995, Sparck Jones and Galliers called for a careful approach to NLP evaluation that is broadly applicable to ML model evaluations (see Appendix D) [91].
評価基準。 1995年、Sparck Jones と Galliers は、MLモデル評価に広く適用可能な NLP 評価への慎重なアプローチを要求した(Appendix D を参照)。 訳抜け防止モード: 評価基準。 1995年、Sparck Jones と Galliers は NLP 評価への慎重なアプローチを要求した。 MLモデルの評価に広く適用できます(Appendix D 参照) [ 91 ]
0.69
Their approach involves a top-down examination of the context and goal of the evaluation before the evaluation design even begins, and their call for careful documentation of the evaluation “remit”—i.e., official responsibilities—is in line with more recent work calling for stakeholder transparency
6Note that in many real-world applications the “ground truth” variable 𝑌 may be a convenient counterfactual fiction, since the system’s actions on the basis of the prediction ˆ𝑌 may inhibit 𝑌 from being realised—for example, a finance ML model may predict a potential customer would default on a loan if given one, and hence the system the model is deployed in may prevent the customer getting a loan in the first place.
They advocate for establishing whose perspectives are adopted in the evaluation and whose interests prompted it.
彼らは、どの視点が評価に採用され、その関心がそれを促進させるかの確立を提唱する。
0.43
Appendix D sketches how Sparck Jones and Galliers’ framework could be adopted for ML model evaluations.
Appendix Dは、Sparck JonesとGalliersのフレームワークがMLモデル評価にどのように採用できるかをスケッチしている。 訳抜け防止モード: Appendix Dはどのようにスケッチするか Sparck Jones と Galliers のフレームワークは ML モデル評価に適用できる。
0.78
Active Testing. Active Testing aims to iteratively choose new items that are most informative in addressing the goals of the evaluation [69, 95] (cf. its cousin Active Learning, which selects items that are informative for the learner).
アクティブテスト。 Active Testingは、[69, 95]の目標に対処する上で最も有意義な新しい項目を反復的に選択することを目的としている(従兄弟のActive Learningでは、学習者に有益な項目を選択する)。
0.73
Active Testing provides a better estimate of model performance than using the same number of test instances sampled I.I.D. Exploring Active Testing in pursuit of fairness testing goals seems a promising direction for future research.
One cautious and conservative approach—especially in the face of great uncertainty—is to simulate “adversaries” trying to provoke harmful outcomes from the system.
Borrowing adversarial techniques from security testing and privacy testing, adversarial testing of models requires due diligence to trigger the most harmful model predictions, using either manually chosen or algorithmically generated test instances [52, 144, 176, 178].
Multidimensional Comparisons. When comparing candidate models, avoid the “Leaderboardism Trap” of believing that a total ordering of candidates is possible.
A multidimensional and nuanced evaluation may provide at best a partial ordering of candidate models, and it may require careful and accountable judgement and qualitative considerations to decide among them.
The Fair Division literature on Social Welfare Orderings may be a promising direction for developing evaluation frameworks that prioritize “egalitarian” considerations, in which greater weighting is given to those who are worst impacted by a model [50].
5.3 Evaluation-driven ML Methodologies In this section, we follow Rostamzadeh et al in drawing inspiration from test-driven practices, such as those of software development [143].
Traditional software testing involves significant time, resources, and effort [72]; even moderatesized software projects spend hundreds of person-hours writing test cases, implementing them, and meticulously documenting the test results.
In fact, software testing is sometimes considered an art [120] requiring its own technical and non-technical skills [112, 148], and entire career paths are built around testing [42].
Test-driven development, often associated with agile software engineering frameworks, integrates testing considerations in all parts of the development process [8, 65].
These processes rely on a deep understanding of software requirements and user behavior to anticipate failure modes during deployment and to expand the test suite.
(In contrast, ML testing is often relegated to a small portion of the ML development cycle, and predominantly focuses on a static snapshot of data to provide performance guarantees.)
These software testing methodologies provide a model for ML testing.
これらのソフトウェアテスト方法論は、MLテストのモデルを提供する。
0.62
First, the model suggests anticipating, planning for, and integrating testing in all stages of the development cycle, research problem ideation, the setting of objectives, and system implementation.
Second, build a practice around bringing diverse perspectives into designing the test suite.
第二に、さまざまな視点をテストスイートの設計に適用するプラクティスを構築すること。
0.67
Additionally, consider participatory approaches (e g , [111]) to ensure that the test suite accounts for societal contexts and embedded values within which the ML system will be deployed.
In contrast, the paradigm of ML evaluation methodologies is that the ML practitioner should not inspect the test data, lest their observations result in design decisions that produce an overfitted model.
In contrast, in real-world applications model developers might benefit from a healthy model ecosystem, for example when they are members of that ecosystem.
(However, when developers come from a different society altogether there may be disinterest or disalignment [145].)
(ただし、開発者が全く別の社会からやって来た場合、不利や不和があるかもしれない[145]。)
0.72
14
14
0.42
英語(論文から抽出)
日本語訳
スコア
Evaluation Gaps in Machine Learning Practice
機械学習実践における評価ギャップ
0.85
FAccT ’22, June 21–24, 2022, Seoul, Republic of Korea
FAccT'22, 6月21-24, 2022, ソウル, 大韓民国
0.85
Software testing produces artifacts such as execution traces, and test coverage information [72].
ソフトウェアテストは、実行トレースやテストカバレッジ情報などの成果物を生成する [72]。
0.86
Developing practices for routinely sharing testing artifacts with stakeholders provides for more robust scrutiny and diagnosis of harmful error cases [136].
In being flexible enough to adapt to the information needs of stakeholders, software testing artifacts can be considered a form of boundary object [158].
Within an ML context, these considerations point towards adopting ML transparency mechanisms incorporating comprehensive evaluations, such as model cards [117].
Finally, as for any high-stakes system—software, ML or otherwise—evaluation documentation constitutes an important part of the chain of auditable artifacts required for robust accountability and governance practices [136].
6 CONCLUSIONS In this paper, we compared the evaluation practices in the ML research community to the ideal information needs of those who use models in real-world applications.
The observed disconnect between the two is likely due to differences in motivations and goals, and also pressures to demonstrate “state-of-the-art” performance on shared tasks, metrics and leaderboards [51, 93, 161], as well as a focus on the learner as the object upon which the researcher hopes to shed light.
One limitation of our methodology is reliance on published papers, and we encourage more human subjects research in the future, in a similar vein to e g [77, 108, 147].
We identified a range of evaluation gaps that risk being overlooked if the ML research community’s evaluation practices are uncritically adopted when for applications, and identify six assumptions that would have to be valid if these gaps are to be overlooked.
The assumptions range from a broad focus on consequentialism to technical concerns regarding distributions of evaluation data.
この仮定は、連続論の幅広い焦点から、評価データの分布に関する技術的な懸念まで様々である。
0.62
By presenting these assumptions as a coherent framework, we provide not just a set of mitigations for each evaluation gap, but also demonstrate the relationships between these mitigations.
We show how in the naive case these assumptions chain together, leading to the grossest assumption that calculating model accuracy on data I.I.D. with the training data can be a reliable signal for real-world applications.
We contrast the practices of ML model evaluation with those of the mature engineering practices of software testing to draw out lessons for non-I.I.D. testing under a variety of stress conditions and failure severities.
One limitation of our analysis is that we are generally domain-agnostic, and we hope to stimulate investigations of assumptions and gaps for specific application domains.
By naming each assumption we identify and exploring its technical and sociological consequences, we hope to encourage more robust interdisciplinary debate and, ultimately, to nudge model evaluation practice away from abundant opaque unknowns.
[20] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine
[20]リシ・ボンマシーニ、ドリュー・ア・ハドソン、エーサン・アデリ、ラス・アルトマン、シムラン・アローラ、シドニー・フォン・アルクス、マイケル・スバーンスタイン、ジャネット・ボーク、アントワーヌ 訳抜け防止モード: [20 ]梨文政尼、ドリュー・ア・ハドソン、エサン・アデリ Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein Jeannette Bohg, Antoine
0.83
Bosselut, Emma Brunskill, et al 2021.
ボッセルート、エマ・ブランスキル、アル・2021。
0.47
On the opportunities and risks of foundation models.
基礎モデルの機会とリスクについてです
0.70
arXiv preprint arXiv:2108.07258 (2021).
arxiv プレプリント arxiv:2108.07258 (2021)
0.44
[21] Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman.
2017. NeuralPower: Predict and deploy energy-efficient convolutional
2017. NeuralPower:エネルギー効率の良い畳み込み予測と展開
0.59
neural networks.
ニューラルネットワーク。
0.65
In Asian Conference on Machine Learning.
アジアの機械学習に関する会議です
0.80
PMLR, 622–637.
PMLR 622-637。
0.81
[29] Dallas Card and Noah A Smith.
29]ダラスカードとノア・エイ・スミス
0.51
2020. On Consequentialism and Fairness.
2020. 連続主義と公正性。
0.58
Frontiers in Artificial Intelligence 3 (2020), 34.
人工知能のフロンティア(2020年)、34。
0.54
[30] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al 2021.
30]nicholas carlini、florian tramer、eric wallace、matthew jagielski、ariel herbert-voss、katherine lee、adam roberts、tom brown、dawn song、ulfar erlingsson、そしてal 2021。 訳抜け防止モード: ニコラス・カルリーニ、フローリアン・トレーマー、エリック・ウォレス。 マシュー・ジャゲルスキー、アリエル・ハーバート - ヴォス、キャサリン・リー、アダム・ロバーツ。 Tom Brown, Dawn Song, Ulfar Erlingsson, et al 2021
0.70
Extracting training data from large language models.
大規模言語モデルからトレーニングデータを抽出する。
0.76
In 30th USENIX Security Symposium (USENIX Security 21).
第30回USENIXセキュリティシンポジウム(USENIX Security 21)に参加。
0.84
2633–2650.
2633–2650.
0.35
[31] Brandon Carter, Siddhartha Jain, Jonas W Mueller, and David Gifford.
[34] Mayee Chen, Karan Goel, Nimit S Sohoni, Fait Poms, Kayvon Fatahalian, and Christopher Ré.
534]Mayee Chen, Karan Goel, Nimit S Sohoni, Fait Poms, Kayvon Fatahalian, Christopher Ré 訳抜け防止モード: 【34】マイ・チェン、カラン・ゴエル、ニミット・ス・ソホニ、 fait poms、kayvon fatahalian、christopher ré。
0.41
2021. Mandoline: Model Evaluation under Distribution
2021. Mandoline: 配布時のモデル評価
0.62
Shift. In International Conference on Machine Learning.
2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments.
2017. 異なる影響を伴う公正な予測:再分裂予測器におけるバイアスの研究
0.59
Big data 5, 2 (2017),
ビッグデータ5、2(2017年)
0.60
[38] Sam Corbett-Davies and Sharad Goel.
[38]サム・コルベット=デイヴィスとシャラード・ゲール。
0.46
2018. The measure and mismeasure of fairness: A critical review of fair machine learning.
2018. 公平性の尺度と誤測定 : 公正な機械学習の批判的レビュー
0.56
arXiv preprint
arXiv プレプリント
0.83
153–163. arXiv:1808.00023 (2018).
153–163. arXiv:1808.00023 (2018)。
0.52
[39] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq.
39] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, Aziz Huq。
0.38
2017. Algorithmic decision making and the cost of fairness.
2017. アルゴリズムによる意思決定と公平性のコスト。
0.60
In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining.
院 知識発見とデータマイニングに関する第23回Sigkdd国際会議の成果。
0.55
797–806. [40] Kate Crawford and Vladan Joler.
797–806. 40]ケイト・クロフォードと ヴラダン・ジョラー
0.47
2018. Anatomy of an AI System.
2018. AIシステムの解剖学。
0.57
(Accessed January, 2022).
(2022年1月閲覧)
0.44
[41] Kate Crawford and Trevor Paglen.
ケイト・クロウフォードとトレバー・パグレン。
0.49
2021. Excavating AI: The politics of images in machine learning training sets.
2021. excavating ai: 機械学習トレーニングセットにおけるイメージの政治。
0.59
AI & SOCIETY (2021), 1–12.
AI&SOCIETY (2021), 1-12。
0.78
[42] Sean Cunningham, Jemil Gambo, Aidan Lawless, Declan Moore, Murat Yilmaz, Paul M Clarke, and Rory V O’Connor.
Sean Cunningham氏、Jermil Gambo氏、Aidan Lawless氏、Declan Moore氏、Murat Yilmaz氏、Paul M Clarke氏、Rory V O’Connor氏。 訳抜け防止モード: [42 ]ショーン・カニンガム,ジェミル・ガンボ,エイダン・ローレス, Declan Moore、Murat Yilmaz、Paul M Clarke、Rory V O’Connor。
0.63
2019. Software testing: a
2019. ソフトウェアテスト: a
0.62
changing career. In European Conference on Software Process Improvement.
キャリアを変える ソフトウェアプロセス改善に関する欧州会議に出席しました
0.64
Springer, 731–742.
スプリンガー、731-742。
0.56
[43] Emma Dahlin.
43]エマ・ダリン。
0.52
2021. Mind the gap!
2021. ギャップを忘れるな!
0.51
On the future of AI research.
ai研究の将来についてです
0.70
Humanities and Social Sciences Communications 8, 1 (2021), 1–4.
人文科学・社会科学コミュニケーション8,1 (2021), 1-4。
0.78
[44] Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al 2020.
Alexander D’Amour氏、Katherine Heller氏、Dan Moldovan氏、Ben Adlam氏、Babak Alipanahi氏、Alex Beutel氏、Christina Chen氏、Jonathan Deaton氏、Jacob Eisenstein氏、Matthew D Hoffman氏など。 訳抜け防止モード: 44) Alexander D'Amour, Katherine Heller, Dan Moldovan Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen Jonathan Deaton氏、Jacob Eisenstein氏、Matthew D Hoffman氏、そして2020年。
0.80
Underspecification presents challenges for credibility in modern machine learning.
以下に示すのは、現代の機械学習における信頼性の課題である。
0.42
arXiv preprint arXiv:2011.03395 (2020).
arxiv プレプリント arxiv:2011.03395 (2020)
0.44
[45] Aida Mostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran.
45] aida mostafazadeh davani, mark díaz, vinodkumar prabhakaran。
0.28
2022. Dealing with disagreements: Looking beyond the majority vote in
2022. 不一致に対処する: 多数決を超えて見る
0.59
subjective annotations. Transactions of the Association for Computational Linguistics 10 (2022), 92–110.
主観的な注釈。 association for computational linguistics 10 (2022), 92–110の取引。
0.44
[46] Harm De Vries, Dzmitry Bahdanau, and Christopher Manning.
46]Harm De Vries、Dzmitry Bahdanau、Christopher Manning。
0.56
2020. Towards ecologically valid research on language user interfaces.
2020. 言語ユーザインタフェースに関する生態学的研究に向けて
0.56
arXiv preprint arXiv:2007.14435 (2020).
arXiv プレプリントarxiv:2007.14435 (2020)。
0.46
and Evaluation (LREC’16).
評価 (lrec’16)。
0.45
261–266. [47] Leon Derczynski.
261–266. レオナルド・デルジンスキー(Leon derczynski)。
0.45
2016. Complementarity, F-score, and NLP Evaluation.
2016. 相補性、Fスコア、NLP評価。
0.56
In Proceedings of the Tenth International Conference on Language Resources
第10回言語資源国際会議の開催にあたって
0.73
[48] Mark Díaz and Nicholas Diakopoulos.
マーク・ディアスとニコラス・ディアコポロス
0.38
2019. Whose walkability?
2019. 誰の歩行性?
0.51
: Challenges in algorithmically measuring subjective experience.
アルゴリズムによる主観的経験測定の課題
0.59
Proceedings of the [49] Laurel Eckhouse, Kristian Lum, Cynthia Conti-Cook, and Julie Ciccolini.
2021. Towards Benchmarking the Utility of Explanations for Model Debugging.
2021. モデルデバッグにおける説明の有用性のベンチマークに向けて
0.50
In Proceedings of the First Workshop on Trustworthy Natural Language Processing.
The First Workshop on Trustworthy Natural Language Processing に参加して
0.74
68–73. [84] IEEE.
68–73. IEEE[84]。
0.51
2019. The IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems.
2019. 自律的・知的システムの倫理に関するieeeグローバルイニシアチブ。
0.51
“Classical Ethics in A/IS”.
A/ISにおける古典的倫理。
0.62
In Ethically Aligned Design: A
倫理的アライメントデザイン: A
0.58
Vision for Prioritizing Human Well-being with Autonomous and Intelligent Systems, First Edition.
自律・インテリジェントシステムによる人間福祉の優先順位付けのためのビジョンファーストエディション
0.75
36–67. [85] Abigail Z Jacobs, Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach.
36–67. 85] アビゲイル・z・ジェイコブス、ス・リン・ブロジェット、ソロン・バロカス、ハル・ダウメ3世、ハンナ・ワラッハ。 訳抜け防止モード: 36–67. [85 ]Abigail Z Jacobs, Su Lin Blodgett, Solon Barocas, ハル・ダウメ3世とハンナ・ワラッハ。
0.56
2020. The meaning and measurement of bias: lessons from
2020. バイアスの意味と測定: そこから学んだこと
0.56
natural language processing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency.
自然言語処理。 2020年の公正、説明責任、透明性に関する会議の議事録で
0.72
706–706. [86] Abigail Z Jacobs and Hanna Wallach.
706–706. 86] アビゲイル・z・ジェイコブスと ハンナ・ワラッハ
0.44
2021. Measurement and fairness.
2021. 測定と公平性。
0.57
In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and
公正・責任・責任に関する2021年度ACM会議の開催にあたって
0.70
Transparency. 375–385.
透明性。 375–385.
0.51
[87] Yasamin Jafarian and Hyun Soo Park.
[87]ヤサミン・ジャファリアンとヒュンソオ公園。
0.50
2021. Learning high fidelity depths of dressed humans by watching social media dance videos.
2021. ソーシャルメディアのダンスビデオを見て、身なりの高い人間の深度を学ぶ。
0.55
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
手続き中 IEEE/CVF Conference on Computer Vision and Pattern Recognition に参加。
[94] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang.
[94]Pang Wei Koh, Sagawa Shiori, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, Percy Liang。 訳抜け防止モード: 94 ]パン・ワイ・コー、 佐川志織、ヘンリク・マークルンド sang michael xie, marvin zhang, akshay balsubramani, weihua hu,michihiro yasunaga richard lanas phillips, sara beery, jure leskovec, anshul kundaje emma pierson氏、sergey levine氏、chelsea finn氏、percy liang氏。
0.59
2020. WILDS: A Benchmark of in-the-Wild Distribution Shifts.
2020. wilds: イン・ザ・ワイルドのディストリビューションシフトのベンチマーク。
0.46
CoRR abs/2012.07421 (2020).
CoRR abs/2012.07421 (2020)。
0.68
https://arxiv.org/ab s/2012.07421
https://arxiv.org/ab s/2012.07421
0.17
18
18
0.43
英語(論文から抽出)
日本語訳
スコア
Evaluation Gaps in Machine Learning Practice
機械学習実践における評価ギャップ
0.85
FAccT ’22, June 21–24, 2022, Seoul, Republic of Korea
FAccT'22, 6月21-24, 2022, ソウル, 大韓民国
0.85
[95] Jannik Kossen, Sebastian Farquhar, Yarin Gal, and Tom Rainforth.
transfer learning. arXiv preprint arXiv:1806.07528 (2018).
転校学習。 arXiv preprint arXiv:1806.07528 (2018)。
0.69
[100] George Lakoff and Mark Johnson.
ジョージ・ラコフとマーク・ジョンソン。
0.51
2008. Metaphors we live by.
2008. 私たちが住んでいるメタファー。
0.43
University of Chicago press. [101] Guillaume Lecué and Matthieu Lerasle.
シカゴ大学出版局。 101年 ギヨーム・ルクエとマチュー・レラースレ
0.48
2020. Robust machine learning by median-of-means: theory and practice.
2020. 平均中央値によるロバストな機械学習:理論と実践
0.58
The Annals of Statistics 48, 2
統計のアナル48, 2
0.63
(2020), 906–931.
(2020), 906–931.
0.49
[102] Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo.
[102]ユンチェン・リ、イェール・ソン、リアンリアン・カオ、ジョエル・テトルー、ラリー・ゴールドバーグ、アレハンドロ・ジェイムス、ジーボ・ルオ。 訳抜け防止モード: [102 ]ユンチェン・リ、イェール・ソン、リエンリエン・カオ、 Joel Tetreault氏、Larry Goldberg氏、Alejandro Jaimes氏、Jiebo Luo氏。
0.70
2016. TGIF: A new dataset and benchmark
2016. TGIF: 新しいデータセットとベンチマーク
0.62
on animated GIF description.
アニメーションGIFで説明します
0.83
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
IEEE Conference on Computer Vision and Pattern Recognition に参加して
0.35
4641–4650.
4641–4650.
0.35
[103] Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt.
103]トーマス・リアオ、ローハン・タウリ、イニオウルワ・デボラ・ラージ、ルートヴィヒ・シュミット
0.34
2021. Are We Learning Yet?
2021. 私たちはまだ学んでいますか?
0.45
A Meta Review of Evaluation Failures Across
評価失敗のメタレビュー
0.53
Machine Learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
機械学習。 第35回ニューラル情報処理システムデータセットとベンチマークトラック(その2)
0.64
[104] Chien-Hsin Lin, Hsin-Yu Shih, and Peter J Sher.
[104]チエン・ヒシン・リン、ヒシン・ユ・シー、ピーター・j・シャー。
0.53
2007. Integrating technology readiness into technology acceptance: The TRAM model.
2007. 技術受容への技術適合性の統合:トラムモデル。
0.62
Psychology [105] Chin-Yew Lin.
心理学 〔105〕鎮結林。
0.54
2004. Rouge: A package for automatic evaluation of summaries.
2004. rouge: 要約の自動評価のためのパッケージ。
0.61
In Text summarization branches out.
テキスト要約では、分岐する。
0.64
74–81. [106] Lydia T Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt.
74–81. 106] Lydia T Liu, Sarah Dean, Esther Rolf, Max Simchowitz, Moritz Hardt。
0.35
2018. Delayed impact of fair machine learning.
2018. 公正な機械学習の影響の遅れ。
0.55
In International Conference & Marketing 24, 7 (2007), 641–657.
国際会議において & Marketing 24, 7 (2007), 641–657.
0.59
on Machine Learning. PMLR, 3150–3158.
機械学習について。 PMLR 3150-3158。
0.50
[107] Chi-kiu Lo and Dekai Wu.
【107年】チ・キウ・ロ・デカイ・ウー
0.19
2010. Evaluating Machine Translation Utility via Semantic Role Labels.
2010. 意味的ロールラベルによる機械翻訳ユーティリティの評価。
0.56
. In LREC.
. LREC所属。
0.55
Citeseer. [108] Michael Madaio, Lisa Egede, Hariharan Subramonyam, Jennifer Wortman Vaughan, and Hanna Wallach.
シーザー。 Michael Madaio氏、Lisa Egede氏、Hariharan Subramonyam氏、Jennifer Wortman Vaughan氏、Hanna Wallach氏。
0.37
2022. Assessing the Fairness of AI Systems: AI Practitioners’ Processes, Challenges, and Needs for Support.
2022. AIシステムの公正性を評価する: AI実践者のプロセス、課題、サポートの必要性。
0.57
Proceedings of the ACM on Human-Computer Interaction 6, CSCW1 (2022), 1–26.
ACM on Human-Computer Interaction 6, CSCW1 (2022), 1–26。
0.41
[109] Alessandro Mantelero.
アレッサンドロ・マンテロ(Alessandro Mantelero)。
0.47
2018. AI and Big Data: A blueprint for a human rights, social and ethical impact assessment.
2018. AIとビッグデータ: 人権、社会的、倫理的影響評価のための青写真。
0.57
Computer Law & Security
コンピュータ法とセキュリティ
0.71
Review 34, 4 (2018), 754–772.
34, 4 (2018), 754–772。
0.35
[110] Marrkula Center.
[110] マルキュラセンター
0.64
2019. Approaches to Ethical Decision-making.
2019. 倫理的意思決定へのアプローチ。
0.49
https://www.scu.edu/ ethics/ethics-resour ces/ethical-decision -making/ [111] Donald Martin, Jr., Vinodkumar Prabhakaran, Jill Kuhlberg, Andrew Smart, and William S. Isaac.
Donald Martin, Jr., Vinodkumar Prabhakaran, Jill Kuhlberg, Andrew Smart, そしてWilliam S. Isaacだ。
0.55
2020. Extending the Machine Learning Abstraction
2020. 機械学習抽象化の拡張
0.63
Boundary: A Complex Systems Approach to Incorporate Societal Context.
境界: 社会的なコンテキストを取り込むための複雑なシステムアプローチ。
0.68
arXiv:2006.09663 [cs.CY]
arXiv:2006.09663[cs.CY]
0.27
[112] Gerardo Matturro.
ジェラルド・マトゥロ(Gerardo Matturro)。
0.52
2013. Soft skills in software engineering: A study of its demand by software companies in Uruguay.
133–138. [132] Vinodkumar Prabhakaran, Ben Hutchinson, and Margaret Mitchell.
133–138. 132] Vinodkumar Prabhakaran, Ben Hutchinson, Margaret Mitchell。
0.34
2019. Perturbation Sensitivity Analysis to Detect Unintended Model Biases.
2019. 意図しないモデルバイアス検出のための摂動感度解析
0.58
In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
1997. Analysis and visualization of classifier performance with nonuniform class and cost distributions.
1997. 非一様クラスとコスト分布を用いた分類器の性能解析と可視化
0.61
In Proceedings of AAAI-97 Workshop on AI Approaches to Fraud Detection & Risk Management.
院 AAAI-97ワークショップ「き裂検出・リスクマネジメントへのAIアプローチ」の開催報告
0.57
57–63. [134] James Pustejovsky.
57–63. 134年 ジェームズ・プステヨフスキー
0.37
1998. The generative lexicon.
1998. 生成的レキシコン。
0.65
MIT press. [135] Inioluwa Deborah Raji, Emily Denton, Emily M Bender, Alex Hanna, and Amandalynne Paullada.
MITの記者。 135] Inioluwa Deborah Raji, Emily Denton, Emily M Bender, Alex Hanna, Amandalynne Paullada。
0.56
2021. AI and the Everything in the Whole Wide
2021. AIと世界全体のすべて
0.50
World Benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
世界ベンチマーク。 第35回ニューラル情報処理システムデータセットとベンチマークトラック(その2)
0.64
[136] Inioluwa Deborah Raji, Andrew Smart, Rebecca N White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes.
[141] Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P. Lalor, Robin Jia, and Jordan Boyd-Graber.
141年]ペドロ・ロドリゲス、ジョー・バロー、アレクサンダー・ミセルリス・ホイル、ジョン・p・ララー、ロビン・ジア、ジョーダン・ボイド=グラバー 訳抜け防止モード: 141 ] Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle ジョン・P・ラーラー、ロビン・ジー、ジョーダン・ボイド。
0.73
2021. Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?
2021. 評価例は平等に非形式的ではない:どのようにNLPリーダーボードを変えるべきか?
0.51
. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
development. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–37.
開発。 acm on human-computer interaction 5, cscw2 (2021), 1-37。
0.71
20
20
0.43
英語(論文から抽出)
日本語訳
スコア
Evaluation Gaps in Machine Learning Practice
機械学習実践における評価ギャップ
0.85
FAccT ’22, June 21–24, 2022, Seoul, Republic of Korea
FAccT'22, 6月21-24, 2022, ソウル, 大韓民国
0.85
[151] David Schlangen.
デイヴィッド・シュランゲン(David Schlangen)。
0.57
2021. Targeting the Benchmark: On Methodology in Current Natural Language Processing Research.
2021. ベンチマークのターゲット:現在の自然言語処理研究の方法論について
0.58
In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers).
第59回計算言語学会年次大会および第11回国際自然言語処理合同会議(第2巻)の開催にあたって
0.55
670–674. [152] Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni.
2021. Generalizing to Unseen Domains: A Survey on Domain
2021. 未開ドメインへの一般化:ドメインに関する調査
0.62
Generalization. In Proceedings of IJCAI 2021.
一般化。 IJCAI 2021に登場。
0.61
[171] Kellie Webster, Marta R Costa-jussà, Christian Hardmeier, and Will Radford.
Kellie Webster氏、Marta R Costa-jussà氏、Christian Hardmeier氏、Will Radford氏。
0.71
2019. Gendered ambiguous pronoun (GAP) shared task at the Gender
2019. Gendered ambiguous pronoun (GAP) shared task at the Gender (英語)
0.44
Bias in NLP Workshop 2019.
バイアスはnlp workshop 2019。
0.55
In Proceedings of the First Workshop on Gender Bias in Natural Language Processing.
自然言語処理におけるジェンダーバイアスに関する第1回ワークショップを終えて
0.79
1–7. [172] Sarah Myers West, Meredith Whittaker, and Kate Crawford.
1–7. サラ・マイヤーズ・ウェスト、メレディス・ウィテカー、ケイト・クロウフォード。
0.44
2019. Discriminating systems.
2019. 識別システム。
0.52
AI Now (2019).
AI Now (2019)。
0.78
[173] Jim Winkens, Rudy Bunel, Abhijit Guha Roy, Robert Stanforth, Vivek Natarajan, Joseph R Ledsam, Patricia MacWilliams, Pushmeet Kohli, Alan Karthikesalingam, Simon Kohl, et al 2020.
Jim Winkens氏、Rudy Bunel氏、Abhijit Guha Roy氏、Robert Stanforth氏、Vivek Natarajan氏、Joseph R Ledsam氏、Patricia MacWilliams氏、Pushmeet Kohli氏、Alan Karthikesalingam氏、Simon Kohl氏。 訳抜け防止モード: 【173年】ジム・ウィンケンズ、ルディ・バネル、アビジット・グハ・ロイ robert stanforth, vivek natarajan, joseph r ledsam, patricia macwilliams pushmeet kohli, alan karthikesalingam, simon kohl, et al 2020など。
0.64
Contrastive Training for Improved Out-of-Distribution Detection.
アウトオブディストリビューション検出の改善のためのコントラストトレーニング
0.56
arXiv e-prints (2020), arXiv–2007.
arxiv e-prints (2020)、arxiv-2007。
0.45
[174] Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris.
174年] ウー、ユペン・ガオ、シャオクアオ・グオ、ジアド・アル=ハラ、スティーヴン・レニー、クリステン・グラウマン、ロジェリオ・フェリス 訳抜け防止モード: 【宝暦11年(174年)・ウー・ユペン・ガオ・キヤオキヤオ・グオ Ziad Al - Halah, Steven Rennie, Kristen Grauman とRogerio Ferisは言う。
0.60
2021. Fashion iq: A new dataset towards retrieving images by natural language feedback.
2021. fashion iq: 自然言語による画像検索のための新しいデータセット。
0.60
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
IEEE/CVF Conference on Computer Vision and Pattern Recognition に参加して
0.41
11307–11317.
11307–11317.
0.35
[175] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha.
In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations.
第59回計算言語学会年次大会および第11回国際自然言語処理合同会議の開催にあたって
0.53
363–371. [177] Jie M Zhang, Mark Harman, Lei Ma, and Yang Liu.
363–371. 177]jie m zhang、mark harman、lei ma、yang liu。
0.38
2020. Machine learning testing: Survey, landscapes and horizons.
2020. マシンラーニングテスト: 調査、展望、地平線。
0.55
IEEE Transactions on Software
ソフトウェアでのieeeトランザクション
0.55
Engineering (2020).
工学(2020年)。
0.42
[178] Wei Emma Zhang, Quan Z Sheng, Ahoud Alhazmi, and Chenliang Li.
【178】海海馬張、クァン・ジン、Ahoud Alhazmi、Chenliang Li。
0.41
2020. Adversarial attacks on deep-learning models in natural language
2020. 自然言語におけるディープラーニングモデルの敵対的攻撃
0.54
processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST) 11, 3 (2020), 1–41.
処理: 調査。 ACM Transactions on Intelligent Systems and Technology (TIST) 11, 3 (2020), 1–41。
0.53
[179] Benjamin Zi Hao Zhao, Mohamed Ali Kaafar, and Nicolas Kourtellis.
179年]ベンジャミン・ジ・ハオ・ジャオ、モハメド・アリ・カーファー、ニコラス・クールテリス
0.42
2020. Not one but many tradeoffs: Privacy vs. utility in differentially private
2020. プライバシとユーティリティの違いによるプライベートのトレードオフ
0.48
machine learning. In Proceedings of the 2020 ACM SIGSAC Conference on Cloud Computing Security Workshop.
2018. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods.
2018. coreference resolution: evaluation and debiasing method におけるジェンダーバイアス
0.62
In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol.
計算言語学会北米支部2018年会議の成果:ヒューマン・ランゲージ・テクノロジー、Vol.
0.43
2. 21
2. 21
0.43
英語(論文から抽出)
日本語訳
スコア
FAccT ’22, June 21–24, 2022, Seoul, Republic of Korea
FAccT'22, 6月21-24, 2022, ソウル, 大韓民国
0.85
Hutchinson, Rostamzadeh, Greer, Heller, and Prabhakaran
APPENDIX A: METRICS IN ML MODEL EVALUATIONS Here we give definitions and categorizations of some of the metrics reported in the study in Section 3.
appendIX A: MetriCS IN ML MODEL Assessment ここでは、第3節で報告されたメトリクスの定義と分類について述べます。
0.73
In practice, there was a long tail since many metrics were used in only a single paper.
実際には、1枚の紙だけに多くのメトリクスが使われたため、長い尾があった。
0.66
Here we include only the metrics which were most frequently observed in our study.
ここでは、調査で最も多く観察された指標のみを含む。
0.64
Metric Accuracy Example Task(s) Classification
メートル法精度 例 task(s)分類
0.58
Metric category Accuracy
メトリックカテゴリの精度
0.87
AUC Bleu Classification
AUC ブルー 分類
0.51
AUC Machine translation Precision
AUC 機械翻訳 正確さ
0.62
Dice Image segmentation Overlap
サイコロ 画像分割 オーバーラップ
0.61
Error rate 𝐹 (or 𝐹1)
エラー率 F (複数形 Fs)
0.76
Classification Accuracy Text classification
分類 正確さ テキスト分類
0.63
Overlap 𝐹0.5
オーバーラップ 𝐹0.5
0.42
Text classification Overlap
テキスト分類 オーバーラップ
0.71
Hausdorff distance Medical Image Segmentation
ハウスドルフ距離 医用画像セグメンテーション
0.68
Distance IoU Matthew’s Correlation Coefficient
距離 イオウ・マシューの相関係数
0.42
Image segmentation Overlap Correlation
画像分割 オーバーラップ相関
0.71
Mean absolute error Regression Distance
絶対誤差 回帰 距離
0.39
22 𝐹 𝑃 2𝑇 𝑃
22 𝐹 𝑃 2𝑇 𝑃
0.43
𝑇 𝑃+𝑇 𝑁 𝐹 𝑃+𝑇 𝑁 ).
𝑇 𝑃+𝑇 𝑁 𝐹 𝑃+𝑇 𝑁 ).
0.42
Definition A metric that penalizes system predictions that do not agree with the reference data 𝑇 𝑃+𝑇 𝑁 +𝐹 𝑃+𝐹 𝑁 ).
定義 基準データ T P+T N +F P+F N と一致しないシステム予測を罰する計量。
0.72
( The area under the curve parameterized by classification threshold 𝑡, typically with 𝑦-axis representing recall and 𝑥-axis representing false positive rate ( A form of “𝑛-gram precision,” originally designed for machine translation but also sometimes used for other text generation tasks, which measures whether sequences of words in the system output are also present in the reference texts [125].
A weighted harmonic mean of recall and precision, with greater weight given to re-
リコールと精度の重み付けされた調和平均で、リコールの重み付けがより大きい
0.55
call ((1 + 𝛽2) 𝑃𝑅
call ((1 + β2) PR
0.47
𝛽2𝑃+𝑅 with 𝛽 = 0.5).
β = 0.5 のβ2P+R。
0.66
𝐹 𝑃+𝐹 𝑁 𝑇 𝑃
𝐹 𝑃+𝐹 𝑁 𝑇 𝑃
0.41
A measure of distance between two sets in a metric space.
距離空間における2つの集合間の距離の測度。
0.83
Two sets have a low Hausdorff distance if every point in each set is close to a point in the other set.
2つの集合がハウスドルフ距離が低く、各集合のすべての点が他方の集合の点に近い場合である。
0.72
𝑇 𝑃+𝐹 𝑃+𝐹 𝑁 .
𝑇 𝑃+𝐹 𝑃+𝐹 𝑁 .
0.35
Equivalent to Jaccard.
jaccardに相当する。
0.77
Has been argued to address shortcomings in 𝐹1’s asymmetry with respect to classes (
F1 のクラスに対する非対称性の欠点について論じられている。
0.74
√(𝑇 𝑃+𝑇 𝑁)(𝑇 𝑃+𝐹 𝑁)(𝑇 𝑁 +𝐹 𝑁)(𝑇 𝑁 +𝐹 𝑃)).
√(𝑇 𝑃+𝑇 𝑁)(𝑇 𝑃+𝐹 𝑁)(𝑇 𝑁 +𝐹 𝑁)(𝑇 𝑁 +𝐹 𝑃)).
0.48
𝑖=1∣ˆ𝑦𝑖 − 𝑦𝑖∣
𝑖=1∣ˆ𝑦𝑖 − 𝑦𝑖∣
0.26
𝑇 𝑃 . 𝑇 𝑁 −𝐹 𝑃 .
𝑇 𝑃 . 𝑇 𝑁 −𝐹 𝑃 .
0.46
𝐹 𝑁 ∑𝑁 1 𝑁
𝐹 𝑁 ∑𝑁 1 𝑁
0.41
英語(論文から抽出)
日本語訳
スコア
Evaluation Gaps in Machine Learning Practice
機械学習実践における評価ギャップ
0.85
FAccT ’22, June 21–24, 2022, Seoul, Republic of Korea
FAccT'22, 6月21-24, 2022, ソウル, 大韓民国
0.85
Metric Mean Average Precision (MAP)
平均平均精度(MAP)
0.65
Example Task(s) Information retrieval (NLP)
例 task(s)情報検索(nlp)
0.69
Metric category AUC
Metric category AUC
0.42
Mean average precision (mAP)
平均平均精度(mAP)
0.79
Object detection (CV) AUC
物体検出(CV)AUC
0.73
Mean reciprocal rank Information retrieval Other
平均相互位階 その他の情報検索
0.58
MSE Image Decomposition Distance
MSE 画像分解 距離
0.51
DisNormalized counted Cumulative Gain (NDCG) Pearson’s 𝑟
DisNormalized counted Cumulative Gain (NDCG) Pearson's r
0.42
Recommendation or ranking tasks
レコメンデーションまたはランキングタスク
0.65
Other Quality Estimation Correlation
その他 品質評価 相関
0.68
Perplexity Language modeling
パープレキシティ 言語モデリング
0.58
Perplexity Precision
パープレキシティ 正確さ
0.50
Classification Precision PSNR
分類 正確さ PSNR
0.48
Super Resolution Distance 23
超高解像度 距離 23
0.45
Definition In information retrieval, the average over information needs of the average precision of the documents retrieved for that need.
定義 情報検索において、その要求のために検索された文書の平均精度の必要な情報平均。
0.83
The area under the Precision-Recall tradeoff curve, averaged over multiple IoU (intersection over union) threshold values, then averaged across all categories (https: //cocodataset.org/#d etection-eval).
A measure for evaluating processes that produces an ordered list of possible responses.
可能な応答の順序リストを生成するプロセスを評価するための尺度。
0.80
The average of the inverse rank of the first relevant item retrieved.
検索された第1関連項目の逆ランクの平均値。
0.74
Mean squared error (MSE) measures the average of the squared difference between estimated and actual values.
平均二乗誤差(MSE)は、推定値と実値の間の二乗差の平均を測定する。
0.73
A measure of ranking quality which takes into account the usefulness of items based on their ranking in the result list.
結果リストのランキングに基づいて項目の有用性を考慮に入れたランキング品質の尺度。
0.65
A measure of linear correlation between two sets of data.
2つのデータ間の線形相関の尺度。
0.72
Information-theoreti c metric (measured in bits-per-unit, e g , bits-per-character or bitsper-sentence) often used for language models, inversely related to the probability assigned to the test data by the model.
情報理論計量(英: information-theoreti c metric)は、しばしば言語モデルに使われ、そのモデルによってテストデータに割り当てられた確率と逆関係がある。
0.67
Closely related to the cross-entropy between the model and the test data.
モデルとテストデータの間のクロスエントロピーと密接に関連している。
0.73
Can be thought of as how efficiently does the language model encode the test data.
言語モデルはテストデータをいかに効率的にエンコードするかと考えることができる。
0.83
A metric that penalizes the system for predicting a class (if class is unspecified, by default the “positive” class) when the reference data did not belong to this class ( 𝑇 𝑃 𝑇 𝑃+𝐹 𝑃 ).
参照データがこのクラスに属していない場合(t p t p+f p )に、あるクラスを予測するためのシステムをペナライズするメトリック(クラスが不特定の場合、デフォルトでは「正の」クラス)。
0.76
Peak Signal-to-Noise ratio (PSNR) is the ratio between the maximum possible value of a signal and the power of distorting noise (Mean Squared Error) that impacts the quality of its representation.
Definition Also known as “sensitivity", this metric that penalizes the system for failing to predict a class (if class is unspecified, by default the “positive” class) when the reference data did belong to this class ( 𝑇 𝑃+𝐹 𝑁 ); a.k.a. true positive rate.
定義は「感度」としても知られており、参照データがこのクラスに属する場合(T P+F N ; 真の正のレート)に、クラスを予測できないシステム(クラスが不特定であれば、デフォルトでは「正の」クラス)を罰する。
0.77
Root Mean Square Error (RMSE) is the square root of the MSE.
Root Mean Square Error (RMSE) はMSEの平方根である。
0.77
A form of “𝑛-gram recall,” originally designed for text summarization but also sometimes used for other text generation tasks, which measures whether sequences of words in the reference texts are also present in the system output[105].
A measure of monotonic association between two variables–less restrictive than linear correlations.
2つの変数間の単調関係の測度-線形相関よりも制限のない。
0.70
Like Precision, this metric that penalizes the system for failing to predict a class (if class is unspecified, by default the “positive” class) when the reference data did belong to this class; unlike Precision it rewards true negatives rather than true positives 𝑇 𝑁 +𝐹 𝑁 ).
Precisionのように、参照データがこのクラスに属していた場合、クラスを予測できないシステム(クラスが不特定であれば、デフォルトでは“正”クラス)を罰するこの計量は、Precisionとは異なり、真の正の T N + F N ではなく真の負の報酬を与える。 訳抜け防止モード: Precisionのように、システムに障害を罰するこの指標 クラスを予測する(クラスが指定されていない場合は、デフォルトでは “ positive ” クラス)。 参照データがこのクラスに属するとき 精度とは異なり、正の正の T N + F N よりも正の負の値を返す。
0.81
( The Structural Similarity Method (SSIM) is a perception-based method for measuring the similarity between two images.
(SSIM)は2つの画像間の類似度を測定するための知覚に基づく手法である。
0.81
The formula is based on comparison measurements of luminance, contrast, and structure.
この公式は、輝度、コントラスト、構造の比較測定に基づいている。
0.71
A metric for systems that return ranked lists, which calculates accuracy over the top 𝑛 entries in each list.
ランク付けされたリストを返すシステムのメトリクスで、各リストの上位n項目の精度を計算する。
0.83
The inverse of word accuracy: 1 − 𝑤𝑜𝑟𝑑 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (which is not technically al-
単語の精度の逆: 1 − 単語の精度(厳密には al ではない)
0.81
ways in[0, 1] due to the way word accuracy
単語の正確さによる[0, 1]の方法
0.68
is defined but which is categorized as “Accuracy” here because both insertions and deletions are penalized).
は、挿入と削除の両方がペナルティ化されているため、ここでは "Accuracy" に分類される。
0.65
24
24
0.42
英語(論文から抽出)
日本語訳
スコア
Evaluation Gaps in Machine Learning Practice
機械学習実践における評価ギャップ
0.85
FAccT ’22, June 21–24, 2022, Seoul, Republic of Korea
FAccT'22, 6月21-24, 2022, ソウル, 大韓民国
0.85
Metric Example Task(s)
メートル法 例 タスク(s)
0.66
Metric category Definition Table 5.
計量圏 定義 テーブル5。
0.65
Definitions and categorizations of metrics reported in Section 3.
第3節で報告されたメトリクスの定義と分類
0.74
TP, TN, FP and FN indicate the number of true positives, true negatives, false positives and false negatives, respectively.
tp, tn, fp, fnはそれぞれ真陽性数, 真陰性数, 偽陽性数, 偽陰性数である。
0.62
𝑦 and ˆ𝑦 represent actual values and values predicted by the system, respectively.
y と y はそれぞれシステムによって予測される実際の値と値を表す。
0.81
25
25
0.43
英語(論文から抽出)
日本語訳
スコア
FAccT ’22, June 21–24, 2022, Seoul, Republic of Korea
FAccT'22, 6月21-24, 2022, ソウル, 大韓民国
0.85
Hutchinson, Rostamzadeh, Greer, Heller, and Prabhakaran
A manually compiled resource (in NLP, often a word-based resource such as a lexicon or thesaurus), against which knowledge acquired from a dataset is compared.
Reference outputs (typically obtained prior to building the system) which a generative system is trying to reproduce, typically obtained from humans (e g , manual translations of input sentences in the case of evaluations using Bleu for machine translation tasks).
Test data that has the same form as the training data but is drawn from a different distribution (e g , in the case of NLP training on labeled newspaper data and testing on labeled Wikipedia data).
Table 6. Types of datasets used in ML model evaluations.
表6。 MLモデル評価で使用されるデータセットの種類。
0.56
26
26
0.42
英語(論文から抽出)
日本語訳
スコア
Evaluation Gaps in Machine Learning Practice
機械学習実践における評価ギャップ
0.85
FAccT ’22, June 21–24, 2022, Seoul, Republic of Korea
FAccT'22, 6月21-24, 2022, ソウル, 大韓民国
0.85
APPENDIX C: EXAMPLE OF ASSUMPTIONS AND GAPS FOR A HYPOTHETICAL APPLICATION Suppose we are evaluating a hypothetical image classification model for use in an application for assisting blind people in identifying groceries in their pantries.
what is the perspective being adopted — task/financial/admin istrative/scientific /... whose interests prompted the evaluation — developer/funder/... who are the consumers of the model evaluation results — manager/user/researc her/...
constitution — what is the structure of the model?
コンスティチューション – モデルの構造はどのようなものか?
0.75
what was the training data?
トレーニングデータは何でしたか?
0.77
To determine: factors that will be tested environment variables ‘system’ parameters
決定する。 環境変数 ‘system’ パラメータがテストされる要因
0.64
evaluation criteria metrics/measures methods Evaluation data — what type, status and nature?
評価基準・測定方法 評価データ — どんなタイプ、ステータス、そして自然か?
0.83
Evaluation procedure Table 7. A sketch of how Karen Sparck Jones and Julia Galliers’ 1995 NLP evaluation framework questionnaire [91] can be adapted for the evaluation of ML models.
評価手順 表7。 Karen Sparck Jones と Julia Galliers の 1995 NLP 評価フレームワーク [91] が ML モデルの評価にどのように適応できるかのスケッチ。
0.63
The output of the remit and the design is a strategy for conducting the model evaluation.
送金と設計の出力は、モデル評価を実行するための戦略である。
0.67
For a related but simpler framework based on model requirements analysis, see also the “7-step Recipe” for NLP system evaluation (https://www.issco.u nige.ch/en/research/ projects/eagles/ewg9 9/7steps.html) developed by the eagles Evaluation Working Group in 1999, which considers whether different parties have a shared understanding of the evaluation’s purpose.