Related papers: Lightweight Neural App Control

Lightweight Neural App Control

URL: http://arxiv.org/abs/2410.17883v2
Date: Wed, 12 Feb 2025 17:51:51 GMT
Title: Lightweight Neural App Control
Authors: Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao,
Abstract summary: This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC)<n>LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions.<n>We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach.
Score: 42.820784178464656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.

Related papers

MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning [21.12739286363107]
Vision-Language Models (VLMs) have achieved remarkable breakthroughs in recent years, enabling a diverse array of applications in everyday life.<n>We introduce MagicVL-2B, a novel VLM meticulously optimized for flagship smartphones.<n>We show that MagicVL-2B matches the accuracy of current state-of-the-art models while reducing on-device power consumption by 41.1%.
arXiv Detail & Related papers (2025-08-03T01:49:08Z)
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning [43.8114307203968]
State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images.<n>In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO)<n>MGPO enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images.
arXiv Detail & Related papers (2025-07-08T12:05:05Z)
MiniCPM4: Ultra-Efficient LLMs on End Devices [124.73631357883228]
MiniCPM4 is a highly efficient large language model (LLM) designed explicitly for end-side devices.<n>We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.<n>MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively.
arXiv Detail & Related papers (2025-06-09T16:16:50Z)
PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing [48.30406812516552]
We introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimize model architecture and edge system constraints. PLM employs a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint. evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data.
arXiv Detail & Related papers (2025-03-15T15:11:17Z)
Are We There Yet? A Measurement Study of Efficiency for LLM Applications on Mobile Devices [5.926813659185372]
Small-size large language models (LLMs) can run successfully on powerful mobile devices, though they exhibit quality limitations compared to larger models. Only small-size LLMs can run successfully on powerful mobile devices, though they exhibit quality limitations compared to larger models.
arXiv Detail & Related papers (2025-03-10T16:27:17Z)
AppVLM: A Lightweight Vision Language Model for Online App Control [39.91330570886891]
We introduce AppVLM, a lightweight Vision-Language Model (VLM) First, we fine-tune it offline on the AndroidControl dataset. Then, we refine its policy by collecting data from the AndroidWorld environment.
arXiv Detail & Related papers (2025-02-10T12:32:21Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z)
Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation [10.817783356090027]
Large language models (LLMs) increasingly integrate into every aspect of our work and daily lives. There are growing concerns about user privacy, which push the trend toward local deployment of these models. As a rapidly emerging application, we are concerned about their performance on commercial-off-the-shelf mobile devices.
arXiv Detail & Related papers (2024-10-04T17:14:59Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
Automated Text Scoring in the Age of Generative AI for the GPU-poor [49.1574468325115]
We analyze the performance and efficiency of open-source, small-scale generative language models for automated text scoring. Results show that GLMs can be fine-tuned to achieve adequate, though not state-of-the-art, performance.
arXiv Detail & Related papers (2024-07-02T01:17:01Z)
OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate. We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z)
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases [81.70591346986582]
We introduce MobileAIBench, a benchmarking framework for evaluating Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices. MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real devices.
arXiv Detail & Related papers (2024-06-12T22:58:12Z)
Confidant: Customizing Transformer-based LLMs via Collaborative Edge Training [18.526329975259483]
Transformer-based large language models (LLMs) have demonstrated impressive capabilities in a variety of natural language processing (NLP) tasks. It is challenging to deploy and fine-tune LLMs on mobile edge devices with limited computing, memory, and energy budgets. We propose Confidant, a multi-backend collaborative training framework for customizing state-of-the-art LLMs on commodity mobile devices.
arXiv Detail & Related papers (2023-11-22T13:20:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.