Fugu-MT 論文翻訳(概要): Using Large Language Models to Support High Volume Application Review for an Undergraduate Research Program

論文の概要: Using Large Language Models to Support High Volume Application Review for an Undergraduate Research Program

arxiv url: http://arxiv.org/abs/2606.05564v1
Date: Thu, 04 Jun 2026 01:20:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.47111
Title: Using Large Language Models to Support High Volume Application Review for an Undergraduate Research Program
Title（参考訳）: 大規模言語モデルを用いた大学院研究プログラムの大規模アプリケーションレビュー支援
Authors: Varun Aggarwal, Kay Kobak, John Howarter,
Abstract要約: このワーク・イン・プログレス・ペーパーでは、パーデュー大学におけるSURF 2026サイクルに対する約1,200人の学生の目的条件(SoP)の評価を支援するために、大規模言語モデル(LLM)ベースのツールの開発と初期展開について述べる。このワークフローはOpenAI GPTモデル(GPT-4o、GPT-5-mini、GPT-5.2)を使用し、6つのサブカテゴリにまたがる構造化ルーブリックを使用しており、それぞれ0-3スケールでスコア付けされている。 GPT-5.2を使用して1200 SoPのフルバッチを約4.6時間で処理し、平均14秒で処理した。
参考スコア（独自算出の注目度）: 0.8602553195689513
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Undergraduate research programs such as the Summer Undergraduate Research Fellowship (SURF) at Purdue University receive thousands of applications every year, requiring significant time and effort for program staff to evaluate each submission consistently and within tight timelines. This work-in-progress paper describes the development and initial deployment of a large language model (LLM)-based tool to assist in the evaluation of approximately 1,200 student Statements of Purpose (SoPs) for the SURF 2026 cycle at Purdue University. The workflow utilizes OpenAI GPT models (GPT-4o, GPT-5-mini, and GPT-5.2) and uses a structured rubric across six subcategories, each scored on a 0-3 scale. A few SoPs, graded by program staff, were used to tune the model responses. The model prompt was designed to generate both numerical scores, rationales (including positive and negative aspects) and short excerpts from each submission. Using GPT-5.2, the full batch of 1,200 SoPs was processed in approximately 4.6 hours of compute time, averaging roughly 14 seconds per SoP (with per-SoP timing varying with SoP length, which ranged from 500 to 2,000 words). Notable differences in rubric adherence were observed across model versions, with GPT-5.2 adhering most closely. Disagreement in model scores was more pronounced for lower-scoring submissions. The LLM outputs replicated the role previously played by distributed human graders, providing the program coordinator with scored and rationale-annotated outputs for the entire applicant pool. The program coordinator then reviewed these outputs alongside each applicant's SoP, applying the same downstream office criteria used in prior SURF cycles, to produce a shortlist of strong candidates. This coordinator review was completed in approximately 4 hours, compared to the multi-week coordination effort required in prior program cycles.
Abstract（参考訳）: プルデュー大学の夏季大学院研究フェローシップ(SURF)のような大学院研究プログラムは、毎年何千ものアプリケーションを受け取り、プログラムスタッフが各応募を一貫して、タイトなタイムライン内で評価するのにかなりの時間と労力を要する。このワーク・イン・プログレス・ペーパーでは、パーデュー大学におけるSURF 2026サイクルに対する約1,200人の学生の目的条件(SoP)の評価を支援するために、大規模言語モデル(LLM)ベースのツールの開発と初期展開について述べる。このワークフローはOpenAI GPTモデル(GPT-4o、GPT-5-mini、GPT-5.2)を使用し、6つのサブカテゴリにまたがる構造化ルーブリックを使用しており、それぞれ0-3スケールでスコア付けされている。プログラムスタッフによって評価されたいくつかのSoPは、モデル応答の調整に使用された。モデルプロンプトは、各提案から数値スコア、有理性(正および負の側面を含む)、短い抜粋の両方を生成するように設計された。 GPT-5.2を使用すると、1200 SoPのフルバッチはおよそ4.6時間で処理され、SoPあたり平均14秒(SoPの長さは500から2,000ワード)で処理された。 GPT-5.2が最も密着したモデルでは、ゴム接着の顕著な相違が観察された。モデルスコアの劣化は、より低いスコアの入力に対して顕著であった。 LLM出力は、以前に分散されたヒトグレードラーが果たした役割を再現し、プログラムコーディネータに、応募者プール全体に対して得点と合理化の出力を提供する。プログラムコーディネータは、各応募者のSOPと共にこれらの出力をレビューし、以前のSURFサイクルで使用される下流のオフィス基準を適用して、強い候補者のショートリストを作成した。このコーディネータのレビューは約4時間で完了した。

論文の概要: Using Large Language Models to Support High Volume Application Review for an Undergraduate Research Program

関連論文リスト