Related papers: ReactGenie: A Development Framework for Complex Multimodal Interactions Using Large Language Models

ReactGenie: A Development Framework for Complex Multimodal Interactions Using Large Language Models

URL: http://arxiv.org/abs/2306.09649v3
Date: Thu, 2 May 2024 08:28:19 GMT
Title: ReactGenie: A Development Framework for Complex Multimodal Interactions Using Large Language Models
Authors: Jackie Junrui Yang, Yingtian Shi, Yuhan Zhang, Karina Li, Daniel Wan Rosli, Anisha Jain, Shuning Zhang, Tianshi Li, James A. Landay, Monica S. Lam,
Abstract summary: multimodal interfaces can surpass the efficiency of either modality alone. This paper presents ReactGenie, a programming framework that better separates multimodal input from the computational model. Our evaluation showed that 12 developers can learn and build a nontrivial ReactGenie application in under 2.5 hours on average.
Score: 12.0218963520643
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: By combining voice and touch interactions, multimodal interfaces can surpass the efficiency of either modality alone. Traditional multimodal frameworks require laborious developer work to support rich multimodal commands where the user's multimodal command involves possibly exponential combinations of actions/function invocations. This paper presents ReactGenie, a programming framework that better separates multimodal input from the computational model to enable developers to create efficient and capable multimodal interfaces with ease. ReactGenie translates multimodal user commands into NLPL (Natural Language Programming Language), a programming language we created, using a neural semantic parser based on large-language models. The ReactGenie runtime interprets the parsed NLPL and composes primitives in the computational model to implement complex user commands. As a result, ReactGenie allows easy implementation and unprecedented richness in commands for end-users of multimodal apps. Our evaluation showed that 12 developers can learn and build a nontrivial ReactGenie application in under 2.5 hours on average. In addition, compared with a traditional GUI, end-users can complete tasks faster and with less task load using ReactGenie apps.

Related papers

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities. We build a multimodal text-centric dataset for multimodal alignment pre-training. We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z)
Executable Code Actions Elicit Better LLM Agents [76.95566120678787]
This work proposes to use Python code to consolidate Large Language Model (LLM) agents' actions into a unified action space (CodeAct) integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions. The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language.
arXiv Detail & Related papers (2024-02-01T21:38:58Z)
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z)
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks [81.9962823875981]
We introduce SwiftSage, a novel agent framework inspired by the dual-process theory of human cognition. The framework comprises two primary modules: the Swift module, representing fast and intuitive thinking, and the Sage module, emulating deliberate thought processes. In 30 tasks from the ScienceWorld benchmark, SwiftSage significantly outperforms other methods such as SayCan, ReAct, and Reflex.
arXiv Detail & Related papers (2023-05-27T07:04:15Z)
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst [24.517389691825667]
ChatBridge is a novel multimodal language model that leverages the expressive capabilities of language to bridge the gap between various modalities. All codes, data, and models of ChatBridge will be open-sourced.
arXiv Detail & Related papers (2023-05-25T14:34:08Z)
i-Code Studio: A Configurable and Composable Framework for Integrative AI [93.74891865028867]
We propose the i-Code Studio, a flexible and composable framework for Integrative AI. The i-Code Studio orchestrates multiple pre-trained models in a fine-tuning-free fashion to conduct complex multimodal tasks. The i-Code Studio achieves impressive results on a variety of zero-shot multimodal tasks, such as video-to-text retrieval, speech-to-speech translation, and visual question answering.
arXiv Detail & Related papers (2023-05-23T06:45:55Z)
Prompting Is Programming: A Query Language for Large Language Models [5.8010446129208155]
We present the novel idea of Language Model Programming (LMP) LMP generalizes language model prompting from pure text prompts to an intuitive combination of text prompting and scripting. We show that LMQL can capture a wide range of state-of-the-art prompting methods in an intuitive way.
arXiv Detail & Related papers (2022-12-12T18:09:09Z)
"Think Before You Speak": Improving Multi-Action Dialog Policy by Planning Single-Action Dialogs [33.78889030078026]
Multi-action dialog policy (MADP) generates multiple atomic dialog actions per turn. We propose Planning Enhanced Dialog Policy (PEDP), a novel multi-task learning framework that learns single-action dialog dynamics. Our fully supervised learning-based method achieves a solid task success rate of 90.6%, improving 3% compared to the state-of-the-art methods.
arXiv Detail & Related papers (2022-04-25T07:55:53Z)
Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward Decomposition [64.06167416127386]
We propose Multi-Agent Dialog Policy Learning, which regards both the system and the user as the dialog agents. Two agents interact with each other and are jointly learned simultaneously. Results show that our method can successfully build a system policy and a user policy simultaneously.
arXiv Detail & Related papers (2020-04-08T04:51:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.