TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial
Creation on Physical Tasks
- URL: http://arxiv.org/abs/2403.08049v1
- Date: Tue, 12 Mar 2024 19:46:59 GMT
- Title: TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial
Creation on Physical Tasks
- Authors: Yuexi Chen, Vlad I. Morariu, Anh Truong, Zhicheng Liu
- Abstract summary: TutoAI is a cross-domain framework for AI-assisted mixed-media tutorial creation on physical tasks.
We distill common tutorial components by surveying existing work.
We present an approach to identify, assemble, and evaluate AI models for component extraction.
- Score: 18.999028085376594
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixed-media tutorials, which integrate videos, images, text, and diagrams to
teach procedural skills, offer more browsable alternatives than timeline-based
videos. However, manually creating such tutorials is tedious, and existing
automated solutions are often restricted to a particular domain. While AI
models hold promise, it is unclear how to effectively harness their powers,
given the multi-modal data involved and the vast landscape of models. We
present TutoAI, a cross-domain framework for AI-assisted mixed-media tutorial
creation on physical tasks. First, we distill common tutorial components by
surveying existing work; then, we present an approach to identify, assemble,
and evaluate AI models for component extraction; finally, we propose guidelines
for designing user interfaces (UI) that support tutorial creation based on
AI-generated components. We show that TutoAI has achieved higher or similar
quality compared to a baseline model in preliminary user studies.
Related papers
- Draw2Learn: A Human-AI Collaborative Tool for Drawing-Based Science Learning [0.0]
Drawing supports learning by externalizing mental models, but providing timely feedback at scale remains challenging.<n>We present Draw2Learn, a system that explores how AI can act as a supportive teammate during drawing-based learning.
arXiv Detail & Related papers (2026-02-02T00:06:08Z) - A Versatile Multimodal Agent for Multimedia Content Generation [66.86040734610073]
We propose a MultiMedia-Agent designed to automate complex content creation tasks.<n>Our agent system includes a data generation pipeline, a tool library for content creation, and a set of metrics for evaluating preference alignment.
arXiv Detail & Related papers (2026-01-06T18:49:47Z) - No More Manual Guides: Automatic and Scalable Generation of High-Quality Excel Tutorials [63.10037761131196]
Existing tutorials are manually authored by experts, require frequent updates after each software release, and incur substantial labor costs.<n>We present the first framework for automatically generating Excel tutorials directly from natural language task descriptions.<n>Our framework improves task execution success rates by 8.5% over state-of-the-art baselines.
arXiv Detail & Related papers (2025-09-26T03:21:39Z) - Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation [94.23160400824969]
We propose a two-stage framework that leverages "shots" as the fundamental units of video understanding.
This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures.
Our method is compatible with both open-source and proprietary Visual-Language Models.
arXiv Detail & Related papers (2025-04-01T17:59:57Z) - DejAIvu: Identifying and Explaining AI Art on the Web in Real-Time with Saliency Maps [0.0]
We introduce DejAIvu, a Chrome Web extension that combines real-time AI-generated image detection with saliency-based explainability.
Our approach integrates efficient in-browser inference, gradient-based saliency analysis, and a seamless user experience, ensuring that AI detection is both transparent and interpretable.
arXiv Detail & Related papers (2025-02-12T22:24:49Z) - AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale.<n>We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials.<n>Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents.
Our approach leverages image-based observations, and grounding instructions in natural language to visual elements.
To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - A Comprehensive Guide to Explainable AI: From Classical Models to LLMs [25.07463077055411]
Explainable Artificial Intelligence (XAI) addresses the growing need for transparency and interpretability in AI systems.
It explores interpretability in traditional models like Decision Trees, Linear Regression, and Support Vector Machines.
The book presents practical techniques such as SHAP, LIME, Grad-CAM, counterfactual explanations, and causal inference.
arXiv Detail & Related papers (2024-12-01T13:01:01Z) - Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces [1.3107174618549584]
Instruction Visual Grounding (IVG) is a multi-modal approach to object identification within a Graphical User Interface (GUI)<n>We propose IVGocr, which combines a Large Language Model (LLM), an object detection model, and an Optical Character Recognition (OCR) module; and IVGdirect, which uses a multimodal architecture for end-to-end grounding.<n>Our final test dataset is publicly released to support future research.
arXiv Detail & Related papers (2024-05-05T19:10:19Z) - CoProNN: Concept-based Prototypical Nearest Neighbors for Explaining Vision Models [1.0855602842179624]
We present a novel approach that enables domain experts to quickly create concept-based explanations for computer vision tasks intuitively via natural language.
The modular design of CoProNN is simple to implement, it is straightforward to adapt to novel tasks and allows for replacing the classification and text-to-image models.
We show that our strategy competes very well with other concept-based XAI approaches on coarse grained image classification tasks and may even outperform those methods on more demanding fine grained tasks.
arXiv Detail & Related papers (2024-04-23T08:32:38Z) - An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents.
Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction.
We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z) - Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning [50.47568731994238]
Key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL)
This paper presents a general framework model for integrating and learning structured reasoning into AI agents' policies.
arXiv Detail & Related papers (2023-12-22T17:57:57Z) - How to Build an AI Tutor that Can Adapt to Any Course and Provide Accurate Answers Using Large Language Model and Retrieval-Augmented Generation [0.0]
The OpenAI Assistants API allows AI Tutor to easily embed, store, retrieve, and manage files and chat history.
The AI Tutor prototype demonstrates its ability to generate relevant, accurate answers with source citations.
arXiv Detail & Related papers (2023-11-29T15:02:46Z) - Vision Encoder-Decoder Models for AI Coaching [0.0]
The feasibility of this method is demonstrated using a Vision Transformer as the encoder and GPT-2 as the decoder.
Our integrated architecture directly processes input images, enabling natural question-and-answer dialogues with the AI coach.
arXiv Detail & Related papers (2023-11-09T09:06:21Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with
Millions of APIs [71.7495056818522]
We introduce TaskMatrix.AI as a new AI ecosystem that connects foundation models with millions of APIs for task completion.
We will present our vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges we need to address next.
arXiv Detail & Related papers (2023-03-29T03:30:38Z) - Build-a-Bot: Teaching Conversational AI Using a Transformer-Based Intent
Recognition and Question Answering Architecture [15.19996462016215]
This paper proposes an interface for students to learn the principles of artificial intelligence by using a natural language pipeline to train a customized model to answer questions based on their own school curriculums.
The pipeline teaches students data collection, data augmentation, intent recognition, and question answering by having them work through each of these processes while creating their AI agent.
arXiv Detail & Related papers (2022-12-14T22:57:44Z) - Instance As Identity: A Generic Online Paradigm for Video Instance
Segmentation [84.3695480773597]
We propose a new online VIS paradigm named Instance As Identity (IAI)
IAI models temporal information for both detection and tracking in an efficient way.
We conduct extensive experiments on three VIS benchmarks.
arXiv Detail & Related papers (2022-08-05T10:29:30Z) - MONAI Label: A framework for AI-assisted Interactive Labeling of 3D
Medical Images [49.664220687980006]
The lack of annotated datasets is a major bottleneck for training new task-specific supervised machine learning models.
We present MONAI Label, a free and open-source framework that facilitates the development of applications based on artificial intelligence (AI) models.
arXiv Detail & Related papers (2022-03-23T12:33:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.