Related papers: A Multimodal Dataset of Student Oral Presentations with Sensors and Evaluation Data

A Multimodal Dataset of Student Oral Presentations with Sensors and Evaluation Data

URL: http://arxiv.org/abs/2601.07576v1
Date: Mon, 12 Jan 2026 14:29:05 GMT
Title: A Multimodal Dataset of Student Oral Presentations with Sensors and Evaluation Data
Authors: Alvaro Becerra, Ruth Cobos, Roberto Daza,
Abstract summary: SOPHIAS is a 12-hour multimodal dataset containing recordings of 50 oral presentations delivered by 65 students at the Universidad Autonoma de Madrid.<n> SOPHIAS integrates eight synchronized sensor streams from high-definition webcams, ambient and webcam audio, eye-tracking glasses, smartwatch physiological sensors, and clicker, keyboard, and mouse interactions.<n>The dataset captures presentations conducted in real classroom settings, preserving authentic student behaviors, interactions, and physiological responses.
Score: 1.0705399532413615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Oral presentation skills are a critical component of higher education, yet comprehensive datasets capturing real-world student performance across multiple modalities remain scarce. To address this gap, we present SOPHIAS (Student Oral Presentation monitoring for Holistic Insights & Analytics using Sensors), a 12-hour multimodal dataset containing recordings of 50 oral presentations (10-15-minute presentation followed by 5-15-minute Q&A) delivered by 65 undergraduate and master's students at the Universidad Autonoma de Madrid. SOPHIAS integrates eight synchronized sensor streams from high-definition webcams, ambient and webcam audio, eye-tracking glasses, smartwatch physiological sensors, and clicker, keyboard, and mouse interactions. In addition, the dataset includes slides and rubric-based evaluations from teachers, peers, and self-assessments, along with timestamped contextual annotations. The dataset captures presentations conducted in real classroom settings, preserving authentic student behaviors, interactions, and physiological responses. SOPHIAS enables the exploration of relationships between multimodal behavioral and physiological signals and presentation performance, supports the study of peer assessment, and provides a benchmark for developing automated feedback and Multimodal Learning Analytics tools. The dataset is publicly available for research through GitHub and Science Data Bank.

Related papers

Real-Time Multimodal Data Collection Using Smartwatches and Its Visualization in Education [0.0]
This paper presents Watch-DMLT, a data acquisition application for Fitbit Sense 2 smartwatches, and ViSeDOPS, a dashboard-based visualization system for analyzing synchronized multimodal data collected during oral presentations.<n>We report on a classroom deployment involving 65 students and up to 16 smartwatches, where data streams including heart rate, motion, gaze, video, and contextual annotations were captured and analyzed. Results demonstrate the feasibility and utility of the proposed system for supporting fine-grained, scalable, and interpretable Multimodal Learning Analytics in real learning environments.
arXiv Detail & Related papers (2025-12-02T11:12:46Z)
Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions [110.43343503158306]
This paper embeds the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands.<n>Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data.<n>We establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis.
arXiv Detail & Related papers (2025-08-06T17:46:23Z)
SensorLM: Learning the Language of Wearable Sensors [50.95988682423808]
We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language.<n>We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data.<n>This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59.7 million hours of data from more than 103,000 people.
arXiv Detail & Related papers (2025-06-10T17:13:09Z)
MOSAIC-F: A Framework for Enhancing Students' Oral Presentation Skills through Personalized Feedback [1.0835264351334324]
This framework integrates Multimodal Learning Analytics (MMLA), Observations, Sensors, Artificial Intelligence (AI), and Collaborative assessments.<n>By combining human-based and data-based evaluation techniques, this framework enables more accurate, personalized and actionable feedback.
arXiv Detail & Related papers (2025-06-10T09:46:31Z)
LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning [58.98865450345401]
We introduce LecEval, an automated metric grounded in Mayer's Cognitive Theory of Multimedia Learning.<n>LecEval assesses effectiveness using four rubrics: Content Relevance (CR), Expressive Clarity (EC), Logical Structure (LS) and Audience Engagement (AE)<n>We curate a large-scale dataset of over 2,000 slides from more than 50 online course videos, annotated with fine-grained human ratings.
arXiv Detail & Related papers (2025-05-04T12:06:47Z)
DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild [1.766742562532995]
In this paper, a novel dataset is introduced, designed to assess student attention within in-person classroom settings.<n>This dataset encompasses RGB camera data, featuring multiple cameras per student to capture both posture and facial expressions.<n>A comprehensive suite of attention and emotion labels for each student is provided, generated through self-reporting and evaluations by four different experts.<n>Our dataset uniquely combines facial and environmental camera data, smartwatch metrics, and includes underrepresented ethnicities in similar datasets, all within in-the-wild, in-person settings.
arXiv Detail & Related papers (2025-02-27T15:50:21Z)
SMARTe-VR: Student Monitoring and Adaptive Response Technology for e-Learning in Virtual Reality [8.605843219338793]
This work introduces SMARTe-VR, a platform for student monitoring in an immersive virtual reality environment designed for online education.<n>The platform allows instructors to create customized learning sessions with video lectures, featuring an interface with an AutoQA system to evaluate understanding.<n>We release a dataset that contains 5 research challenges with data from 10 users in VR-based TOEIC sessions.
arXiv Detail & Related papers (2025-01-19T07:53:39Z)
A multimodal dataset for understanding the impact of mobile phones on remote online virtual education [8.605843219338793]
IMPROVE dataset includes behavioral, biometric, physiological, and academic performance data collected from 120 learners.<n>A setup involving 16 synchronized sensors-including EEG, eye tracking, video cameras, smartwatches, and keystroke dynamics-was used to monitor learner activity.<n>Technical validation confirmed signal quality, and statistical analyses revealed biometric changes associated with phone usage.
arXiv Detail & Related papers (2024-12-13T11:29:05Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content. Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects. We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z)
Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance. This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings. Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.