AI-Instruments: Embodying Prompts as Instruments to Abstract & Reflect Graphical Interface Commands as General-Purpose Tools
- URL: http://arxiv.org/abs/2502.18736v1
- Date: Wed, 26 Feb 2025 01:11:24 GMT
- Title: AI-Instruments: Embodying Prompts as Instruments to Abstract & Reflect Graphical Interface Commands as General-Purpose Tools
- Authors: Nathalie Riche, Anna Offenwanger, Frederic Gmeiner, David Brown, Hugo Romat, Michel Pahud, Nicolai Marquardt, Kori Inkpen, Ken Hinckley,
- Abstract summary: Chat-based prompts respond with linear-sequential texts, making it difficult to explore and refine ambiguous intents.<n>We show how AI-Instruments embody "prompts" as interface objects via three key principles.
- Score: 22.004677014808458
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chat-based prompts respond with verbose linear-sequential texts, making it difficult to explore and refine ambiguous intents, back up and reinterpret, or shift directions in creative AI-assisted design work. AI-Instruments instead embody "prompts" as interface objects via three key principles: (1) Reification of user-intent as reusable direct-manipulation instruments; (2) Reflection of multiple interpretations of ambiguous user-intents (Reflection-in-intent) as well as the range of AI-model responses (Reflection-in-response) to inform design "moves" towards a desired result; and (3) Grounding to instantiate an instrument from an example, result, or extrapolation directly from another instrument. Further, AI-Instruments leverage LLM's to suggest, vary, and refine new instruments, enabling a system that goes beyond hard-coded functionality by generating its own instrumental controls from content. We demonstrate four technology probes, applied to image generation, and qualitative insights from twelve participants, showing how AI-Instruments address challenges of intent formulation, steering via direct manipulation, and non-linear iterative workflows to reflect and resolve ambiguous intents.
Related papers
- VISOR: VIsual Spatial Object Reasoning for Language-driven Object Navigation [24.25129798349837]
Language-driven object navigation requires agents to interpret natural language descriptions of target objects.<n>Existing methods either (i) use end-to-end trained models with vision-language embeddings, which struggle to generalize beyond training data and lack action-level explainability, or (ii) rely on modular zero-shot pipelines with large language models (LLMs) and open-set object detectors, which suffer from error propagation, high computational cost, and difficulty integrating their reasoning back into the navigation policy.
arXiv Detail & Related papers (2026-02-07T14:01:29Z) - Bridging Gulfs in UI Generation through Semantic Guidance [16.245249868262178]
We develop a system that enables users to specify semantics, visualize relationships, and extract how semantics are reflected in generated UIs.<n>A comparative user study suggests that our approach enhances users' perceived control over intent expression, outcome interpretation, and facilitates more predictable, iterative refinement.
arXiv Detail & Related papers (2026-01-27T04:01:53Z) - In-Video Instructions: Visual Signals as Generative Control [79.44662698914401]
We investigate whether capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions.<n>In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories.<n>Experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions.
arXiv Detail & Related papers (2025-11-24T18:38:45Z) - Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation [60.63465682731118]
The performance of egocentric AI agents is fundamentally limited by multimodal intent ambiguity.<n>We introduce the Plug-and-Play Clarifier, a zero-shot and modular framework that decomposes the problem into discrete, solvable sub-tasks.<n>Our framework improves the intent clarification performance of small language models by approximately 30%, making them competitive with significantly larger counterparts.
arXiv Detail & Related papers (2025-11-12T04:28:14Z) - Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z) - MIRA: Empowering One-Touch AI Services on Smartphones with MLLM-based Instruction Recommendation [61.19099947706954]
This paper introduces MIRA, a pioneering framework for task instruction recommendation.<n>With MIRA, users can long-press on images or text objects to receive contextually relevant instruction recommendations for executing AI tasks.<n> MIRA has demonstrated substantial improvements in the accuracy of instruction recommendation.
arXiv Detail & Related papers (2025-09-17T07:43:14Z) - ThematicPlane: Bridging Tacit User Intent and Latent Spaces for Image Generation [49.805992099208595]
We introduce ThematicPlane, a system that enables users to navigate and manipulate high-level semantic concepts.<n>This interface bridges the gap between tacit creative intent and system control.
arXiv Detail & Related papers (2025-08-08T06:57:14Z) - Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation [52.6091162517921]
INSIGHT is a two-stage framework for egocentric action anticipation.<n>In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions.<n>In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning.
arXiv Detail & Related papers (2025-08-03T12:52:27Z) - SimStep: Chain-of-Abstractions for Incremental Specification and Debugging of AI-Generated Interactive Simulations [16.00479720281197]
Chain-of-Abstractions (CoA) is a way to recover programming's core affordances.<n>CoA decomposes the synthesis process into a sequence of cognitively meaningful, task-aligned representations.<n>SimStep is an authoring environment for teachers that scaffolds simulation creation through four intermediate abstractions.
arXiv Detail & Related papers (2025-07-13T14:54:17Z) - DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation [1.4154022683679812]
Vision-and-Language Navigation (VLN) is a challenging task where an agent must understand language instructions and navigate unfamiliar environments using visual cues.<n>We propose a Dual Object Perception-Enhancement Network (DOPE) to address these issues to improve navigation performance.
arXiv Detail & Related papers (2025-04-30T06:47:13Z) - Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.
Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.
We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z) - SHAPE-IT: Exploring Text-to-Shape-Display for Generative Shape-Changing Behaviors with LLMs [12.235304780960142]
This paper introduces text-to-shape-display, a novel approach to generating dynamic shape changes in pin-based shape displays through natural language commands.
By leveraging large language models (LLMs) and AI-chaining, our approach allows users to author shape-changing behaviors on demand through text prompts without programming.
arXiv Detail & Related papers (2024-09-10T04:18:49Z) - NaturalVLM: Leveraging Fine-grained Natural Language for
Affordance-Guided Visual Manipulation [21.02437461550044]
Many real-world tasks demand intricate multi-step reasoning.
We introduce a benchmark, NrVLM, comprising 15 distinct manipulation tasks.
We propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions.
arXiv Detail & Related papers (2024-03-13T09:12:16Z) - Answer is All You Need: Instruction-following Text Embedding via
Answering the Question [41.727700155498546]
This paper offers a new viewpoint, which treats the instruction as a question about the input text and encodes the expected answers to obtain the representation accordingly.
Specifically, we propose InBedder that instantiates this embed-via-answering idea by only fine-tuning language models on abstractive question answering tasks.
arXiv Detail & Related papers (2024-02-15T01:02:41Z) - Interactive AI Alignment: Specification, Process, and Evaluation Alignment [30.599781014726823]
Modern AI enables a high-level, declarative form of interaction.
Users describe the intended outcome they wish an AI to produce, but do not actually create the outcome themselves.
arXiv Detail & Related papers (2023-10-23T14:33:11Z) - From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes.
The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models.
Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z) - I3: Intent-Introspective Retrieval Conditioned on Instructions [83.91776238599824]
I3 is a unified retrieval system that performs Intent-Introspective retrieval across various tasks conditioned on Instructions without task-specific training.
I3 incorporates a pluggable introspector in a parameter-isolated manner to comprehend specific retrieval intents.
It utilizes extensive LLM-generated data to train I3 phase-by-phase, embodying two key designs: progressive structure pruning and drawback-based data refinement.
arXiv Detail & Related papers (2023-08-19T14:17:57Z) - $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction
Execution for Robots [9.393951367344894]
This work explores the capacity of large language models to address problems at the intersection of spatial planning and natural language interfaces for navigation.
We focus on following complex instructions that are more akin to natural conversation than traditional explicit procedural directives typically seen in robotics.
We leverage the 3D simulator AI2Thor to create household query scenarios at scale, and augment it by adding complex language queries for 40 object types.
arXiv Detail & Related papers (2023-07-21T19:09:37Z) - ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts [92.92047324641622]
We propose modAlity-aligneD Action PrompTs (ADAPT) for Vision-Language Navigation (VLN)
ADAPT provides the VLN agent with action prompts to enable the explicit learning of action-level modality alignment.
Experimental results on both R2R and RxR show the superiority of ADAPT over state-of-the-art methods.
arXiv Detail & Related papers (2022-05-31T02:41:31Z) - Object-and-Action Aware Model for Visual Language Navigation [70.33142095637515]
Vision-and-Language Navigation (VLN) is unique in that it requires turning relatively general natural-language instructions into robot agent actions.
We propose an Object-and-Action Aware Model (OAAM) that processes these two different forms of natural language based instruction separately.
This enables each process to match object-centered/action-centered instruction to their own counterpart visual perception/action orientation flexibly.
arXiv Detail & Related papers (2020-07-29T06:32:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.