Related papers: A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata

A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata

URL: http://arxiv.org/abs/2602.20531v1
Date: Tue, 24 Feb 2026 04:17:50 GMT
Title: A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata
Authors: Azrin Sultana, Firoz Ahmed,
Abstract summary: This study proposes a lightweight vision-language framework that integrates both mobile UI and semantic information for app rating prediction.<n>The framework combines MobileNetV3 to extract visual features from UI layouts and DistilBERT to extract textual features.<n>The proposed model is evaluated using mean absolute error (MAE), root mean square error (RMSE), mean squared error (MSE), coefficient of determination (R2), and Pearson correlation.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: App ratings are among the most significant indicators of the quality, usability, and overall user satisfaction of mobile applications. However, existing app rating prediction models are largely limited to textual data or user interface (UI) features, overlooking the importance of jointly leveraging UI and semantic information. To address these limitations, this study proposes a lightweight vision--language framework that integrates both mobile UI and semantic information for app rating prediction. The framework combines MobileNetV3 to extract visual features from UI layouts and DistilBERT to extract textual features. These multimodal features are fused through a gated fusion module with Swish activations, followed by a multilayer perceptron (MLP) regression head. The proposed model is evaluated using mean absolute error (MAE), root mean square error (RMSE), mean squared error (MSE), coefficient of determination (R2), and Pearson correlation. After training for 20 epochs, the model achieves an MAE of 0.1060, an RMSE of 0.1433, an MSE of 0.0205, an R2 of 0.8529, and a Pearson correlation of 0.9251. Extensive ablation studies further demonstrate the effectiveness of different combinations of visual and textual encoders. Overall, the proposed lightweight framework provides valuable insights for developers and end users, supports sustainable app development, and enables efficient deployment on edge devices.

Related papers

Using Vision + Language Models to Predict Item Difficulty [0.0]
We use GPT-4.1-nano to analyze items and generate predictions based on distinct feature sets.<n>The multimodal approach, using both visual and text features, yields the lowest mean absolute error (MAE) (0.224)<n>The best-performing multimodal model was applied to a held-out test set for external evaluation and achieved a mean squared error of 0.10805.
arXiv Detail & Related papers (2026-03-04T23:26:25Z)
FC-MIR: A Mobile Screen Awareness Framework for Intent-Aware Recommendation based on Frame-Compressed Multimodal Trajectory Reasoning [7.78727102442322]
We propose the FC-MIR framework: leveraging sampling and adaptive concatenation, it cuts visual redundancy to boost inference efficiency.<n>We further expand task scope to explore generating post-prediction operations and search suggestions, and introduce a fine-grained metric to evaluate the practical utility of summaries, predictions, and suggestions.<n>We deploy the framework in a real-world setting, integrating UI perception and UI-Agent proxies to lay a foundation for future progress in this field.
arXiv Detail & Related papers (2025-12-22T07:21:07Z)
One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework [51.50565654314582]
Large language models can follow users' instructions throughout a dialogue spanning multiple topics.<n>Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user's interactive experience.<n>We propose a framework for assessing multi-turn instruction-following ability.
arXiv Detail & Related papers (2025-11-05T14:39:59Z)
AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z)
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z)
SymCERE: Symmetric Contrastive Learning for Robust Review-Enhanced Recommendation [2.087411180679868]
We propose SymCERE, a contrastive learning method that addresses false negatives in recommendation.<n>Experiments on 15 datasets from three platforms demonstrate that SymCERE outperforms several strong baselines.
arXiv Detail & Related papers (2025-04-03T00:40:09Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Mind the Gap! Static and Interactive Evaluations of Large Audio Models [55.87220295533817]
Large Audio Models (LAMs) are designed to power voice-native experiences.<n>This study introduces an interactive approach to evaluate LAMs and collect 7,500 LAM interactions from 484 participants.
arXiv Detail & Related papers (2025-02-21T20:29:02Z)
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models. Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z)
VMI-VAE: Variational Mutual Information Maximization Framework for VAE With Discrete and Continuous Priors [5.317548969642376]
Variational Autoencoder is a scalable method for learning latent variable models of complex data. We propose a Variational Mutual Information Maximization Framework for VAE to address this issue.
arXiv Detail & Related papers (2020-05-28T12:44:23Z)
AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction [75.16836697734995]
We propose a two-stage algorithm called Automatic Feature Interaction Selection (AutoFIS) AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence. AutoFIS has been deployed onto the training platform of Huawei App Store recommendation service.
arXiv Detail & Related papers (2020-03-25T06:53:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.