Related papers: The competent Computational Thinking test (cCTt): a valid, reliable and gender-fair test for longitudinal CT studies in grades 3-6

The competent Computational Thinking test (cCTt): a valid, reliable and gender-fair test for longitudinal CT studies in grades 3-6

URL: http://arxiv.org/abs/2305.19526v2
Date: Mon, 5 Aug 2024 17:02:54 GMT
Title: The competent Computational Thinking test (cCTt): a valid, reliable and gender-fair test for longitudinal CT studies in grades 3-6
Authors: Laila El-Hamamsy, María Zapata-Cáceres, Estefanía Martín-Barroso, Francesco Mondada, Jessica Dehler Zufferey, Barbara Bruno, Marcos Román-González,
Abstract summary: This study investigated whether the competent Computational Thinking test (cCTt) could evaluate learning reliably from grades 3 to 6 (ages 7-11) using data from 2709 students. The findings indicate that the cCTt is valid, reliable and gender-fair for grades 3-6, although more complex items would be beneficial for grades 5-6.
Score: 0.06282171844772422
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The introduction of computing education into curricula worldwide requires multi-year assessments to evaluate the long-term impact on learning. However, no single Computational Thinking (CT) assessment spans primary school, and no group of CT assessments provides a means of transitioning between instruments. This study therefore investigated whether the competent CT test (cCTt) could evaluate learning reliably from grades 3 to 6 (ages 7-11) using data from 2709 students. The psychometric analysis employed Classical Test Theory, Item Response Theory, Measurement Invariance analyses which include Differential Item Functioning, normalised z-scoring, and PISA's methodology to establish proficiency levels. The findings indicate that the cCTt is valid, reliable and gender-fair for grades 3-6, although more complex items would be beneficial for grades 5-6. Grade-specific proficiency levels are provided to help tailor interventions, with a normalised scoring system to compare students across and between grades, and help establish transitions between instruments. To improve the utility of CT assessments among researchers, educators and practitioners, the findings emphasise the importance of i) developing and validating gender-fair, grade-specific, instruments aligned with students' cognitive maturation, and providing ii) proficiency levels, and iii) equivalency scales to transition between assessments. To conclude, the study provides insight into the design of longitudinal developmentally appropriate assessments and interventions.

Related papers

Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection [1.4624458429745086]
Current practices often rely on inconsistent and biased metrics.<n>Expert-level claims about AI performance are frequently made without rigorous validation.<n>This study proposes best practices tailored to the specific challenges of neonatal seizure detection.
arXiv Detail & Related papers (2025-08-06T21:55:28Z)
Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning [81.50681925980135]
We propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps.<n>It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making.<n>Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results.
arXiv Detail & Related papers (2025-05-23T12:42:50Z)
NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom [0.0]
Using data from Cloze tests administered to students in Brazil, WE models for Brazilian Portuguese (PT-BR) were employed to measure semantic similarity. A comparative analysis between the WE models' scores and the judges' evaluations revealed that GloVe was the most effective model.
arXiv Detail & Related papers (2024-11-02T15:22:26Z)
CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design [15.2100541345819]
CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. It consists of two datasets: "CT-Repo," containing baseline features from 1,690 clinical trials sourced from clinicaltrials.gov, and "CT-Pub," a subset of 100 trials with more comprehensive baseline features gathered from relevant publications.
arXiv Detail & Related papers (2024-06-25T18:52:48Z)
Wearable Device-Based Real-Time Monitoring of Physiological Signals: Evaluating Cognitive Load Across Different Tasks [6.673424334358673]
This study employs cutting-edge wearable monitoring technology to conduct cognitive load assessment on electroencephalogram (EEG) data of secondary vocational students. The research delves into their application value in assessing cognitive load among secondary vocational students and their utility across various tasks.
arXiv Detail & Related papers (2024-06-11T10:48:26Z)
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z)
Survey of Computerized Adaptive Testing: A Machine Learning Perspective [66.26687542572974]
Computerized Adaptive Testing (CAT) provides an efficient and tailored method for assessing the proficiency of examinees. This paper aims to provide a machine learning-focused survey on CAT, presenting a fresh perspective on this adaptive testing method.
arXiv Detail & Related papers (2024-03-31T15:09:47Z)
Analyzing-Evaluating-Creating: Assessing Computational Thinking and Problem Solving in Visual Programming Domains [21.14335914575035]
Computational thinking (CT) and problem-solving skills are increasingly integrated into K-8 school curricula worldwide. We have developed ACE, a novel test focusing on the three higher cognitive levels in Bloom's taxonomy. We evaluate the psychometric properties of ACE through a study conducted with 371 students in grades 3-7 from 10 schools.
arXiv Detail & Related papers (2024-03-18T20:18:34Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
AutoTrial: Prompting Language Models for Clinical Trial Design [53.630479619856516]
We present a method named AutoTrial to aid the design of clinical eligibility criteria using language models. Experiments on over 70K clinical trials verify that AutoTrial generates high-quality criteria texts.
arXiv Detail & Related papers (2023-05-19T01:04:16Z)
Learning to diagnose cirrhosis from radiological and histological labels with joint self and weakly-supervised pretraining strategies [62.840338941861134]
We propose to leverage transfer learning from large datasets annotated by radiologists, to predict the histological score available on a small annex dataset. We compare different pretraining methods, namely weakly-supervised and self-supervised ones, to improve the prediction of the cirrhosis. This method outperforms the baseline classification of the METAVIR score, reaching an AUC of 0.84 and a balanced accuracy of 0.75.
arXiv Detail & Related papers (2023-02-16T17:06:23Z)
The competent Computational Thinking test (cCTt): Development and validation of an unplugged Computational Thinking test for upper primary school [0.8367620276482053]
The competent CT test (cCTt) is an unplugged CT test targeting 7-9 year-old students. The expert evaluation indicates that the cCTt shows good face, construct, and content validity. The psychometric analysis of the student data demonstrates adequate reliability, difficulty, and discriminability.
arXiv Detail & Related papers (2022-03-11T15:05:35Z)
Opportunities of a Machine Learning-based Decision Support System for Stroke Rehabilitation Assessment [64.52563354823711]
Rehabilitation assessment is critical to determine an adequate intervention for a patient. Current practices of assessment mainly rely on therapist's experience, and assessment is infrequently executed due to the limited availability of a therapist. We developed an intelligent decision support system that can identify salient features of assessment using reinforcement learning.
arXiv Detail & Related papers (2020-02-27T17:04:07Z)
Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education. Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding. We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.