Related papers: VTutor: An Open-Source SDK for Generative AI-Powered Animated Pedagogical Agents with Multi-Media Output

VTutor: An Open-Source SDK for Generative AI-Powered Animated Pedagogical Agents with Multi-Media Output

URL: http://arxiv.org/abs/2502.04103v2
Date: Thu, 13 Feb 2025 17:57:44 GMT
Title: VTutor: An Open-Source SDK for Generative AI-Powered Animated Pedagogical Agents with Multi-Media Output
Authors: Eason Chen, Chenyu Lin, Xinyi Tang, Aprille Xi, Canwen Wang, Jionghao Lin, Kenneth R Koedinger,
Abstract summary: This paper introduces VTutor, an open-source Software Development Kit (SDK) that combines generative AI with advanced animation technologies.<n>VTutor enables researchers and developers to design emotionally resonant, contextually adaptive learning agents.<n>This toolkit enhances learner engagement, feedback receptivity, and human-AI interaction while promoting trustworthy AI principles in education.
Score: 10.419430731115405
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid evolution of large language models (LLMs) has transformed human-computer interaction (HCI), but the interaction with LLMs is currently mainly focused on text-based interactions, while other multi-model approaches remain under-explored. This paper introduces VTutor, an open-source Software Development Kit (SDK) that combines generative AI with advanced animation technologies to create engaging, adaptable, and realistic APAs for human-AI multi-media interactions. VTutor leverages LLMs for real-time personalized feedback, advanced lip synchronization for natural speech alignment, and WebGL rendering for seamless web integration. Supporting various 2D and 3D character models, VTutor enables researchers and developers to design emotionally resonant, contextually adaptive learning agents. This toolkit enhances learner engagement, feedback receptivity, and human-AI interaction while promoting trustworthy AI principles in education. VTutor sets a new standard for next-generation APAs, offering an accessible, scalable solution for fostering meaningful and immersive human-AI interaction experiences. The VTutor project is open-sourced and welcomes community-driven contributions and showcases.

Related papers

Open-Sora: Democratizing Efficient Video Production for All [15.68402186082992]
We create Open-Sora, an open-source video generation model designed to produce high-fidelity video content. Open-Sora supports a wide spectrum of visual generation tasks, including text-to-image generation, text-to-video generation, and image-to-video generation. By embracing the open-source principle, Open-Sora democratizes full access to all the training/inference/data preparation codes as well as model weights.
arXiv Detail & Related papers (2024-12-29T08:52:49Z)
Generative AI and Its Impact on Personalized Intelligent Tutoring Systems [0.0]
Generative AI enables personalized education through dynamic content generation, real-time feedback, and adaptive learning pathways. Report explores key applications such as automated question generation, customized feedback mechanisms, and interactive dialogue systems. Future directions highlight the potential advancements in multimodal AI integration, emotional intelligence in tutoring systems, and the ethical implications of AI-driven education.
arXiv Detail & Related papers (2024-10-14T16:01:01Z)
From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents [78.15899922698631]
MAIC (Massive AI-empowered Course) is a new form of online education that leverages LLM-driven multi-agent systems to construct an AI-augmented classroom. We conduct preliminary experiments at Tsinghua University, one of China's leading universities.
arXiv Detail & Related papers (2024-09-05T13:22:51Z)
V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM [0.0]
This paper presents V-Zen, an innovative Multimodal Large Language Model (MLLM) meticulously crafted to revolutionise the domain of GUI understanding and grounding. V-Zen establishes new benchmarks in efficient grounding and next-action prediction, thereby laying the groundwork for self-operating computer systems. The successful integration of V-Zen and GUIDE marks the dawn of a new era in multimodal AI research, opening the door to intelligent, autonomous computing experiences.
arXiv Detail & Related papers (2024-05-24T08:21:45Z)
iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
arXiv Detail & Related papers (2024-05-24T05:29:12Z)
AI-Tutoring in Software Engineering Education [0.7631288333466648]
We conducted an exploratory case study by integrating the GPT-3.5-Turbo model as an AI-Tutor within the APAS Artemis. The findings highlight advantages, such as timely feedback and scalability. However, challenges like generic responses and students' concerns about a learning progress inhibition when using the AI-Tutor were also evident.
arXiv Detail & Related papers (2024-04-03T08:15:08Z)
An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z)
How to Build an AI Tutor That Can Adapt to Any Course Using Knowledge Graph-Enhanced Retrieval-Augmented Generation (KG-RAG) [5.305156933641317]
KG-RAG (Knowledge Graph-enhanced Retrieval-Augmented Generation) addresses two critical challenges in LLM-based tutoring systems: information hallucination and limited course-specific adaptation.<n>We implement the framework using Qwen2.5, demonstrating its cost-effectiveness while maintaining high performance.<n>Our framework offers a scalable approach to personalized AI tutoring, ensuring response accuracy and pedagogical coherence.
arXiv Detail & Related papers (2023-11-29T15:02:46Z)
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing [99.80742991922992]
The system can have multi-turn dialogues with human users by taking multimodal user inputs and generating multimodal responses. LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled to align human intents in the interaction.
arXiv Detail & Related papers (2023-11-01T15:13:43Z)
NExT-GPT: Any-to-Any Multimodal LLM [75.5656492989924]
We present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. We introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation.
arXiv Detail & Related papers (2023-09-11T15:02:25Z)
DMCNet: Diversified Model Combination Network for Understanding Engagement from Video Screengrabs [0.4397520291340695]
Engagement plays a major role in developing intelligent educational interfaces. Non-deep learning models are based on the combination of popular algorithms such as Histogram of Oriented Gradient (HOG), Support Vector Machine (SVM), Scale Invariant Feature Transform (SIFT) and Speeded Up Robust Features (SURF) The deep learning methods include Densely Connected Convolutional Networks (DenseNet-121), Residual Network (ResNet-18) and MobileNetV1.
arXiv Detail & Related papers (2022-04-13T15:24:38Z)
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs. CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.