Related papers: TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical Tasks

Related papers

Draw2Learn: A Human-AI Collaborative Tool for Drawing-Based Science Learning [0.0]
Drawing supports learning by externalizing mental models, but providing timely feedback at scale remains challenging.<n>We present Draw2Learn, a system that explores how AI can act as a supportive teammate during drawing-based learning.
arXiv Detail & Related papers (2026-02-02T00:06:08Z)
A Versatile Multimodal Agent for Multimedia Content Generation [66.86040734610073]
We propose a MultiMedia-Agent designed to automate complex content creation tasks.<n>Our agent system includes a data generation pipeline, a tool library for content creation, and a set of metrics for evaluating preference alignment.
arXiv Detail & Related papers (2026-01-06T18:49:47Z)
No More Manual Guides: Automatic and Scalable Generation of High-Quality Excel Tutorials [63.10037761131196]
Existing tutorials are manually authored by experts, require frequent updates after each software release, and incur substantial labor costs.<n>We present the first framework for automatically generating Excel tutorials directly from natural language task descriptions.<n>Our framework improves task execution success rates by 8.5% over state-of-the-art baselines.
arXiv Detail & Related papers (2025-09-26T03:21:39Z)
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation [94.23160400824969]
We propose a two-stage framework that leverages "shots" as the fundamental units of video understanding. This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures. Our method is compatible with both open-source and proprietary Visual-Language Models.
arXiv Detail & Related papers (2025-04-01T17:59:57Z)
DejAIvu: Identifying and Explaining AI Art on the Web in Real-Time with Saliency Maps [0.0]
We introduce DejAIvu, a Chrome Web extension that combines real-time AI-generated image detection with saliency-based explainability. Our approach integrates efficient in-browser inference, gradient-based saliency analysis, and a seamless user experience, ensuring that AI detection is both transparent and interpretable.
arXiv Detail & Related papers (2025-02-12T22:24:49Z)
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale.<n>We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials.<n>Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z)
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements. To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
A Comprehensive Guide to Explainable AI: From Classical Models to LLMs [25.07463077055411]
Explainable Artificial Intelligence (XAI) addresses the growing need for transparency and interpretability in AI systems. It explores interpretability in traditional models like Decision Trees, Linear Regression, and Support Vector Machines. The book presents practical techniques such as SHAP, LIME, Grad-CAM, counterfactual explanations, and causal inference.
arXiv Detail & Related papers (2024-12-01T13:01:01Z)
Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces [1.3107174618549584]
Instruction Visual Grounding (IVG) is a multi-modal approach to object identification within a Graphical User Interface (GUI)<n>We propose IVGocr, which combines a Large Language Model (LLM), an object detection model, and an Optical Character Recognition (OCR) module; and IVGdirect, which uses a multimodal architecture for end-to-end grounding.<n>Our final test dataset is publicly released to support future research.
arXiv Detail & Related papers (2024-05-05T19:10:19Z)
CoProNN: Concept-based Prototypical Nearest Neighbors for Explaining Vision Models [1.0855602842179624]
We present a novel approach that enables domain experts to quickly create concept-based explanations for computer vision tasks intuitively via natural language. The modular design of CoProNN is simple to implement, it is straightforward to adapt to novel tasks and allows for replacing the classification and text-to-image models. We show that our strategy competes very well with other concept-based XAI approaches on coarse grained image classification tasks and may even outperform those methods on more demanding fine grained tasks.
arXiv Detail & Related papers (2024-04-23T08:32:38Z)
An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z)
Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning [50.47568731994238]
Key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL) This paper presents a general framework model for integrating and learning structured reasoning into AI agents' policies.
arXiv Detail & Related papers (2023-12-22T17:57:57Z)
How to Build an AI Tutor that Can Adapt to Any Course and Provide Accurate Answers Using Large Language Model and Retrieval-Augmented Generation [0.0]
The OpenAI Assistants API allows AI Tutor to easily embed, store, retrieve, and manage files and chat history. The AI Tutor prototype demonstrates its ability to generate relevant, accurate answers with source citations.
arXiv Detail & Related papers (2023-11-29T15:02:46Z)
Vision Encoder-Decoder Models for AI Coaching [0.0]
The feasibility of this method is demonstrated using a Vision Transformer as the encoder and GPT-2 as the decoder. Our integrated architecture directly processes input images, enabling natural question-and-answer dialogues with the AI coach.
arXiv Detail & Related papers (2023-11-09T09:06:21Z)
Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL) A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z)
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs [71.7495056818522]
We introduce TaskMatrix.AI as a new AI ecosystem that connects foundation models with millions of APIs for task completion. We will present our vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges we need to address next.
arXiv Detail & Related papers (2023-03-29T03:30:38Z)
Build-a-Bot: Teaching Conversational AI Using a Transformer-Based Intent Recognition and Question Answering Architecture [15.19996462016215]
This paper proposes an interface for students to learn the principles of artificial intelligence by using a natural language pipeline to train a customized model to answer questions based on their own school curriculums. The pipeline teaches students data collection, data augmentation, intent recognition, and question answering by having them work through each of these processes while creating their AI agent.
arXiv Detail & Related papers (2022-12-14T22:57:44Z)
Instance As Identity: A Generic Online Paradigm for Video Instance Segmentation [84.3695480773597]
We propose a new online VIS paradigm named Instance As Identity (IAI) IAI models temporal information for both detection and tracking in an efficient way. We conduct extensive experiments on three VIS benchmarks.
arXiv Detail & Related papers (2022-08-05T10:29:30Z)
MONAI Label: A framework for AI-assisted Interactive Labeling of 3D Medical Images [49.664220687980006]
The lack of annotated datasets is a major bottleneck for training new task-specific supervised machine learning models. We present MONAI Label, a free and open-source framework that facilitates the development of applications based on artificial intelligence (AI) models.
arXiv Detail & Related papers (2022-03-23T12:33:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.