Related papers: SHAPE-IT: Exploring Text-to-Shape-Display for Generative Shape-Changing Behaviors with LLMs

SHAPE-IT: Exploring Text-to-Shape-Display for Generative Shape-Changing Behaviors with LLMs

URL: http://arxiv.org/abs/2409.06205v1
Date: Tue, 10 Sep 2024 04:18:49 GMT
Title: SHAPE-IT: Exploring Text-to-Shape-Display for Generative Shape-Changing Behaviors with LLMs
Authors: Wanli Qian, Chenfeng Gao, Anup Sathya, Ryo Suzuki, Ken Nakagaki,
Abstract summary: This paper introduces text-to-shape-display, a novel approach to generating dynamic shape changes in pin-based shape displays through natural language commands. By leveraging large language models (LLMs) and AI-chaining, our approach allows users to author shape-changing behaviors on demand through text prompts without programming.
Score: 12.235304780960142
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces text-to-shape-display, a novel approach to generating dynamic shape changes in pin-based shape displays through natural language commands. By leveraging large language models (LLMs) and AI-chaining, our approach allows users to author shape-changing behaviors on demand through text prompts without programming. We describe the foundational aspects necessary for such a system, including the identification of key generative elements (primitive, animation, and interaction) and design requirements to enhance user interaction, based on formative exploration and iterative design processes. Based on these insights, we develop SHAPE-IT, an LLM-based authoring tool for a 24 x 24 shape display, which translates the user's textual command into executable code and allows for quick exploration through a web-based control interface. We evaluate the effectiveness of SHAPE-IT in two ways: 1) performance evaluation and 2) user evaluation (N= 10). The study conclusions highlight the ability to facilitate rapid ideation of a wide range of shape-changing behaviors with AI. However, the findings also expose accuracy-related challenges and limitations, prompting further exploration into refining the framework for leveraging AI to better suit the unique requirements of shape-changing systems.

Related papers

SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection [4.013156524547072]
This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection.<n>The proposed framework adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention.<n>Experiments on challenging benchmarks demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2025-07-27T09:16:39Z)
Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models [24.203389418820123]
Visually Guided Decoding (VGD) is a gradient-free approach that leverages large language models and CLIP-based guidance to generate coherent and semantically aligned prompts.<n>Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts.
arXiv Detail & Related papers (2025-05-13T14:40:22Z)
ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification [34.82553240281019]
Visible-Infrared Person Re-identification (VIReID) aims to match visible and infrared pedestrian images. Existing methods rely solely on identity label supervision. vision-language pre-trained models have been introduced to VIReID, enhancing semantic information modeling.
arXiv Detail & Related papers (2025-04-25T02:37:47Z)
Textual-to-Visual Iterative Self-Verification for Slide Generation [46.99825956909532]
We decompose the task of generating missing presentation slides into two key components: content generation and layout generation. Our approach significantly outperforms baseline methods in terms of alignment, logical flow, visual appeal, and readability.
arXiv Detail & Related papers (2025-02-21T12:21:09Z)
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL) Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations. These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
From Pixels to Words: Leveraging Explainability in Face Recognition through Interactive Natural Language Processing [2.7568948557193287]
Face Recognition (FR) has advanced significantly with the development of deep learning, achieving high accuracy in several applications. The lack of interpretability of these systems raises concerns about their accountability, fairness, and reliability. We propose an interactive framework to enhance the explainability of FR models by combining model-agnostic Explainable Artificial Intelligence (XAI) and Natural Language Processing (NLP) techniques.
arXiv Detail & Related papers (2024-09-24T13:40:39Z)
Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism. Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z)
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification. In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion. In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z)
Successor Features for Efficient Multisubject Controlled Text Generation [48.37713738712319]
We introduce SF-GEN, which is grounded in two primary concepts: successor features (SFs) and language model rectification. SF-GEN seamlessly integrates the two to enable dynamic steering of text generation with no need to alter the LLM's parameters. To the best of our knowledge, our research represents the first application of successor features in text generation.
arXiv Detail & Related papers (2023-11-03T00:17:08Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Exploring Effective Factors for Improving Visual In-Context Learning [56.14208975380607]
In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. This paper shows that prompt selection and prompt fusion are two major factors that have a direct impact on the inference performance of visual context learning. We propose a simple framework prompt-SelF for visual in-context learning.
arXiv Detail & Related papers (2023-04-10T17:59:04Z)
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models [67.19124099815645]
We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting. LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available. LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
arXiv Detail & Related papers (2022-10-03T17:56:35Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.