Related papers: FC-MIR: A Mobile Screen Awareness Framework for Intent-Aware Recommendation based on Frame-Compressed Multimodal Trajectory Reasoning

FC-MIR: A Mobile Screen Awareness Framework for Intent-Aware Recommendation based on Frame-Compressed Multimodal Trajectory Reasoning

URL: http://arxiv.org/abs/2512.19107v1
Date: Mon, 22 Dec 2025 07:21:07 GMT
Title: FC-MIR: A Mobile Screen Awareness Framework for Intent-Aware Recommendation based on Frame-Compressed Multimodal Trajectory Reasoning
Authors: Zhe Yang, Xiaoshuang Sheng, Zhengnan Zhang, Jidong Wu, Zexing Wang, Xin He, Shenghua Xu, Guanjing Xiong,
Abstract summary: We propose the FC-MIR framework: leveraging sampling and adaptive concatenation, it cuts visual redundancy to boost inference efficiency.<n>We further expand task scope to explore generating post-prediction operations and search suggestions, and introduce a fine-grained metric to evaluate the practical utility of summaries, predictions, and suggestions.<n>We deploy the framework in a real-world setting, integrating UI perception and UI-Agent proxies to lay a foundation for future progress in this field.
Score: 7.78727102442322
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Identifying user intent from mobile UI operation trajectories is critical for advancing UI understanding and enabling task automation agents. While Multimodal Large Language Models (MLLMs) excel at video understanding tasks, their real-time mobile deployment is constrained by heavy computational costs and inefficient redundant frame processing. To address these issues, we propose the FC-MIR framework: leveraging keyframe sampling and adaptive concatenation, it cuts visual redundancy to boost inference efficiency, while integrating state-of-the-art closed-source MLLMs or fine-tuned models (e.g., Qwen3-VL) for trajectory summarization and intent prediction. We further expand task scope to explore generating post-prediction operations and search suggestions, and introduce a fine-grained metric to evaluate the practical utility of summaries, predictions, and suggestions. For rigorous assessment, we construct a UI trajectory dataset covering scenarios from UI-Agents (Agent-I) and real user interactions (Person-I). Experimental results show our compression method retains performance at 50%-60% compression rates; both closed-source and fine-tuned MLLMs demonstrate strong intent summarization, supporting potential lightweight on-device deployment. However, MLLMs still struggle with useful and "surprising" suggestions, leaving room for improvement. Finally, we deploy the framework in a real-world setting, integrating UI perception and UI-Agent proxies to lay a foundation for future progress in this field.

Related papers

Constitutional Black-Box Monitoring for Scheming in LLM Agents [1.4619913143519836]
We use language models to examine agent behaviors for suspicious actions.<n>We study constitutional black-box monitors that detect scheming using only externally observable inputs and outputs.<n>We find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization.
arXiv Detail & Related papers (2026-02-28T22:31:32Z)
CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension [49.6969505536365]
We propose CREM, with a unified framework that enhances multimodal representations for retrieval while preserving generative ability.<n>CREM achieves state-of-the-art retrieval performance on MMEB while maintaining strong generative performance on multiple comprehension benchmarks.
arXiv Detail & Related papers (2026-02-22T08:09:51Z)
Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation [50.22481337087162]
Referring Video Object (RVOS) aims to segment objects in videos based on textual queries.<n>Refer-Agent is a collaborative multi-agent system with alternating reasoning-reflection mechanisms.
arXiv Detail & Related papers (2026-02-03T14:48:12Z)
Co-Training Vision Language Models for Remote Sensing Multi-task Learning [68.15604397741753]
Vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning.<n>We present RSCoVLM, a simple yet flexible VLM baseline for RS MTL.<n>We propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery.
arXiv Detail & Related papers (2025-11-26T10:55:07Z)
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z)
UI-UG: A Unified MLLM for UI Understanding and Generation [19.7078650905834]
We introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities.<n>For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding.<n>For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs.
arXiv Detail & Related papers (2025-09-29T06:59:09Z)
Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition [8.584946920657517]
This paper introduces a novel approach to understanding user intents from user interaction trajectories.<n>We perform structured interaction summarization, capturing key information from each user action.<n>Second, we perform intent extraction using a fine-tuned model operating on the aggregated summaries.
arXiv Detail & Related papers (2025-09-15T20:20:30Z)
Framing AI System Benchmarking as a Learning Task: FlexBench and the Open MLPerf Dataset [0.0]
Existing AI system benchmarks such as LLMerf often struggle to keep pace with the rapidly evolving AI landscape, making it difficult to support informed deployment, optimization, and co-design decisions for AI systems.<n>We suggest that benchmarking itself can be framed as an AI task - one in which models are continuously evaluated and optimized across diverse datasets, software, and hardware, using key metrics such as accuracy, latency, throughput, energy consumption, and cost.
arXiv Detail & Related papers (2025-09-14T20:02:15Z)
Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [54.53508601749513]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework.<n>To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features.<n>Results show that TOFC achieves up to 52% reduction in data transmission overhead and 63% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z)
RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs [38.34856927170692]
We propose a training-free framework for analyzing trained Multimodal Large Language Model (MLLM)<n>It consists of Probe-Activated Dynamic FFN and Hollow Attention, which enable adjustable reductions in computations for visual tokens.<n>Experiments demonstrate substantial, structured, and clustered redundancy unique to decoder-only MLLMs.
arXiv Detail & Related papers (2025-01-31T11:09:16Z)
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL) Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning. We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.