Fugu-MT 論文翻訳(概要): NERIF: GPT-4V for Automatic Scoring of Drawn Models

論文の概要: NERIF: GPT-4V for Automatic Scoring of Drawn Models

arxiv url: http://arxiv.org/abs/2311.12990v1
Date: Tue, 21 Nov 2023 20:52:04 GMT
ステータス: 翻訳完了
システム内更新日: 2023-11-23 17:11:49.293334
Title: NERIF: GPT-4V for Automatic Scoring of Drawn Models
Title（参考訳）: NERIF: ドローニングモデルの自動スコーリングのためのGPT-4V
Authors: Gyeong-Geon Lee, and Xiaoming Zhai
Abstract要約: 最近リリースされたGPT-4Vは、科学的モデリングの実践を前進させるユニークな機会を提供する。我々は,GPT-4Vに学生の描画モデルを評価するための指導音とルーブリックを用いた手法を開発した。 GPT-4Vのスコアを人間の専門家のスコアと比較し、スコアの精度を計算した。
参考スコア（独自算出の注目度）: 0.6278186810520364
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Scoring student-drawn models is time-consuming. Recently released GPT-4V provides a unique opportunity to advance scientific modeling practices by leveraging the powerful image processing capability. To test this ability specifically for automatic scoring, we developed a method NERIF (Notation-Enhanced Rubric Instruction for Few-shot Learning) employing instructional note and rubrics to prompt GPT-4V to score students' drawn models for science phenomena. We randomly selected a set of balanced data (N = 900) that includes student-drawn models for six modeling assessment tasks. Each model received a score from GPT-4V ranging at three levels: 'Beginning,' 'Developing,' or 'Proficient' according to scoring rubrics. GPT-4V scores were compared with human experts' scores to calculate scoring accuracy. Results show that GPT-4V's average scoring accuracy was mean =.51, SD = .037. Specifically, average scoring accuracy was .64 for the 'Beginning' class, .62 for the 'Developing' class, and .26 for the 'Proficient' class, indicating that more proficient models are more challenging to score. Further qualitative study reveals how GPT-4V retrieves information from image input, including problem context, example evaluations provided by human coders, and students' drawing models. We also uncovered how GPT-4V catches the characteristics of student-drawn models and narrates them in natural language. At last, we demonstrated how GPT-4V assigns scores to student-drawn models according to the given scoring rubric and instructional notes. Our findings suggest that the NERIF is an effective approach for employing GPT-4V to score drawn models. Even though there is space for GPT-4V to improve scoring accuracy, some mis-assigned scores seemed interpretable to experts. The results of this study show that utilizing GPT-4V for automatic scoring of student-drawn models is promising.
Abstract（参考訳）: 学生が描いたモデルの装飾には時間がかかる。最近リリースされたGPT-4Vは、強力な画像処理機能を活用することで、科学的モデリングの実践を促進するユニークな機会を提供する。自動採点に特化してこの能力を試験するために,GPT-4Vに指導用ノートとルーリックを用いたNERIF(Notation-Enhanced Rubric Instruction for Few-shot Learning)を開発した。 6つのモデリング評価タスクのために学生が作成したモデルを含むバランスデータ(n = 900)をランダムに選択した。 GPT-4Vのスコアは, それぞれ「ベジンニング」, 「開発」, 「熟練」の3レベルであった。 GPT-4Vのスコアを人間の専門家のスコアと比較し、スコアの精度を計算した。その結果, GPT-4Vの平均評価精度は平均=.51, SD=.037であった。具体的には、平均スコアリング精度は、"beginning"クラスでは.64、"developing"クラスでは.62、"proficient"クラスでは.26であり、より熟練したモデルの方がスコアリングが難しいことを示している。さらに質的研究により、GPT-4Vは、問題コンテキスト、人間のコーダによるサンプル評価、学生の描画モデルを含む、画像入力から情報を取得する方法を明らかにする。また,GPT-4Vが学生が描いたモデルの特徴をいかに捉え,自然言語でナレーションするかを明らかにした。最終的に,gpt-4vが与えられたスコアルブリックと指導ノートに従って,学生が作成したモデルにスコアを割り当てる様子を実演した。その結果, NERIF は GPT-4V を用いた描画モデルに有効であることが示唆された。 GPT-4Vは精度を向上する余地はあるものの、いくつかの誤ったスコアは専門家に解釈可能であるように思われた。本研究の結果から,GPT-4Vを用いた学生図面の自動採点が期待できることがわかった。

論文の概要: NERIF: GPT-4V for Automatic Scoring of Drawn Models

関連論文リスト